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DEVELOPMENT OF A NAVIER-STOKES ALGORITHM 
FOR PARALLEL-PROCESSING SUPERCOMPUTERS 


ABSTRACT 

An explicit flow solver, applicable to the hierarchy of model equations rang- 
ing from Euler to full Navier-Stokes, is combined with several techniques 
designed to reduce computational expense. The computational domain consists 
of local grid refinements embedded in a global coarse mesh, where the locations 
of these refinements are defined by the physics of the flow. Flow characteristics 
are also used to determine which set of model equations is appropriate for solu- 
tion in each region, thereby reducing not only the number of grid points at which 
the solution must be obtained, but also the computational effort required to get 
that solution. Acceleration to steady-state is achieved by applying multigrid on 
each of the subgrids, regardless of the particular model equations being solved. 
Since each of these components is explicit, advantage can readily be taken of the 
vector- and parallel-processing capabilities of machines such as the Cray X-MP 
and Cray-2. 
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1. Introduction 


1.1 Motivation 

The objective of this research is to develop and explore an algorithmic stra- 
tegy for efficient simulation of complex aerodynamic flows on parallel-processing 
supercomputers. Minimizing flow field solution time can be accomplished by 
increasing both the efficiency of numerical algorithms and their rate of computa- 
tion. Although several researchers have computed flows over complete aircraft 
configurations [1-3], the routine use of this capability as a design tool is beyond 
the reach of available processing power. Likewise, progress is being made in the 
simulation of internal flows, such as the flow through turbomachine stages [4]. 
However, simulation of an entire engine in which all parts are functioning would 
require orders-of-magnitude increases in computer speed and memory size. On a 
much different physical scale, calculation of turbulent flows using large eddy 
simulation with the Navier-Stokes equations requires such high resolution that 
computations on appropriate grids can currently only be done for very simple 
geometries. To apply the full Navier-Stokes equations to turbulent flows over 
complex geometries with no modeling of turbulence is, again, far beyond present 
capabilities. Figure 1.1 [5] summarizes these observations. 

There are several ingredients key to attacking these currently-intractable 
fluid dynamics problems. The first is the development of robust algorithms 
which require a minimal number of operations to obtain solutions. Secondly, 
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judicious distribution of grid points in the physical domain of the problem to 
optimize resolution and minimize cost is important. Finally, attaining the max- 
imum execution rate (and hence, efficiency) possible of the fastest computers 
currently available, while paying due consideration to architectural design evolu- 
tion, is crucial. 

The state-of-the-art in algorithm technology for compressible flow simulation 
includes both implicit and explicit solution procedures for the Reynolds-averaged 
Navier-Stokes equations; examples are the Beam- Warming implicit scheme or the 
explicit MacCormack or Runge-Kutta schemes with multigrid convergence 
acceleration. Additionally, multiblock or zonal methods are gaining in popular- 
ity, and more sophisticated artificial dissipation models, upwind schemes using 
Riemann solvers and total variation diminishing schemes are being increasingly 
used for higher accuracy in complex flow simulation. Emphasis is beginning to 
shift toward time-accurate three-dimensional simulations as computing power 
and storage capacity grows. 

Closely coupled to algorithm strategies are advances in grid generation tech- 
niques. The use of unstructured meshes, automated or adaptive grid refinements, 
localized grid-embedding schemes, and interactive grid generation procedures 
contribute to a reduction in both computational costs and human effort required 
to devise discretizations of complex flow field domains. Expert systems may even 
become the common means of grid definition in the future (6,7). 

Achieving the maximum performance of the most powerful computers avail- 
able enjoins the use of multiple processing units focused on a single solution. 
The need for using more than one processor for improvement in computational 
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capability is driven by the fact that the speed of the single CPU is approaching 
its technological limit [8]. The actual implementation of parallelism in computer 
architecture takes many forms, as will be discussed subsequently; whether many 
weak processors are better suited for CFD algorithms than are a few powerful 
ones is a subject of contemporary research. The level of maturity of software 
(such as parallel languages, autotasking compilers, and tools for multitasking 
analysis) has significant impact on the pace at which parallel algorithms are 
developed. The techniques of vectorization, after a decade of use, are fairly 
well-understood, while methods for multiprocessing are relatively undeveloped. 
Until mature tools to facilitate parallel code development become widely avail- 
able, significant attention must be paid to the details of programming advanced 
computers. 

The challenge of solving complex flow problems in a parallel computing 
environment forms the impetus for this work. With the ultimate goal of being 
able to numerically solve the Navier-Stokes equations for realistic three- 
dimensional configurations, the approach taken here entails the application of a 
robust flow solver, the creation of embedded grid refinements, the use of conver- 
gence acceleration, and adaptation of the solution procedure to a parallel- 
processing supercomputer. Optimization of each element in the solution strategy 
must be done with consideration of its effect on the others; that is, the absolutely 
best scalar algorithm may perform quite poorly on a particular advanced- 
architecture machine, and as such might not be the method of choice. Parallel 
computing requires a "detailed understanding of the interactions among large- 
scale applications, parallel algorithms, and the architectures of parallel-processing 
devices" [9j. Therefore, a synergistic approach is taken. 
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Due to the high level of contemporary interest in this approach, a review of 
recent developments is useful. The reader may note that much of the research 
reviewed here has been carried out in the not-to-distant past; this fact attests to 
the increasing importance of computational resources in aerodynamic simulation. 


1.2 Literature Review 

A survey of the modern literature in computational fluid dynamics, with par- 
ticular attention to the aforementioned subject areas, helps to form a sound basis 
for devising an efficient and robust solution procedure. A review of three- 
dimensional Euler and Navier-Stokes computations reveals a number of innova- 
tive, cost-conscious methods, as researchers strive to make the most of available 
resources. A representative subset of these important investigations is presented 
in section 1.2.1, together with a brief history of computational fluid dynamics 
methods. The convergence acceleration procedure employed in the present work, 
multigrid, which receives much attention from applied mathematicians as well as 
computational fluid dynamicists, is discussed in section 1.2.2. The contributions 
of the mathematicians to the understanding of and improvements to multigrid 
complement its more intuitive usage in CFD. Zonal schemes and grid embed- 
dings are a subset of the grid techniques used to further reduce simulation costs 
and improve the accuracy of flow results. Recent work in this area is detailed in 
section 1.2.3. Parallelism in computer architectures has provided the stimulus for 
a plethora of research activities in parallel algorithms and in mapping algorithms 
to machines and is the focus of the fourth section of the present chapter. 
Finally, several of the issues addressed in the current work have also been con- 
sidered by other investigators, and are mentioned where appropriate. 
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1.2.1 Three-Dimensional Euler and Navier-Stokes Computations 

With the increasing availability of computational resources such as the Cray- 
2, with its massive memory, and the X-MP, with its SSD secondary storage capa- 
bility, a limited number of simulations of three-dimensional Navier-Stokes flows 
about complex geometries have been performed. The first viscous flow simula- 
tion about a full aircraft configuration (a delta-wing experimental aircraft con- 
cept, the X24C-10D), using the MacCormack method, was reported by Shang 
and Scherr in 1985 [1]. Jameson, Baker, and Weatherill [2] shortly thereafter 
computed the inviscid flow about a transport configuration that included engine 
nacelles. Other complete aircraft computations have been carried out. Flores, 
Reznick, Holst, and Gundy [3] and Reznick and Flores [10] have used a zonal 
grid scheme to compute separated flow about an F-16 fighter-type aircraft. Yu, 
Kusunose, Chen, and Sommerfield [11] performed inviscid transonic simulations 
for a commercial transport. Chaussee, Rizk, and Buning [12] solved the parabol- 
ized Navier-Stokes equations for the flow around the space shuttle, and, more 
recently, compared the numerically generated flow field about the shuttle in 
ascent configuration with experimental data. Volpe, Siclari, and Jameson [13] 
have applied a new multigrid method to inviscid fighter-type configurations. 

Holst [14] has surveyed recent progress in three-dimensional applications. 
Work which lays the foundation for computations about complete aircraft 
configurations includes the investigation of flows about three-dimensional bodies. 
This is often the culmination of new algorithm development in two dimensions 
which is extended to three-dimensional simulations. Holst [14] presents estimated 
memory and execution time requirements for solving the Navier-Stokes equations 
at various levels of simplification for airfoils, wings, and complete aircraft 
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configurations using a computer capable of one gigaflop, or one billion floating 
point operations per second. While the inviscid flow over a complete aircraft 
would be possible to simulate in minutes on such a system, solutions to the 
Reynolds-averaged form of the equations would consume hours or even days of 
CPU time. Large eddy simulations are projected to take years, and the direct 
simulation of all scales of turbulence with the Navier-Stokes equations in this 
gigaflop machine would be virtually impractical (i.e., off the scale). It is clear 
from this that even orders of magnitude (versus merely incremental) increases in 
speed will not sate the appetite of the computational fluid dynamicist. 

Shankar and Chakravarthy [15] have presented results of inviscid cases for 
multiple three-dimensional aerodynamic bodies. Benek, Buning, and Steger [16] 
have solved the inviscid flow about a wing/body/tail configuration using a 
Chimera, or overset, grid scheme. Benek, Donegan, and Suhs [17] then extended 
this method to viscous flows for a three-dimensional multiple-body configuration. 
Holst, Gundy, Flores, Chaderjian, Kaynak, and Thomas [18] applied a zonal pro- 
cedure to the viscous flow over a wing with good results. In Ref. 19, Flores 
examines and enhances the convergence properties of the zonal scheme. 

Complex flow phenomena, such as those which develop over wings at high 
angles of attack, provide a strong test of any three-dimensional method. 
Obayashi and Fujii [20] have calculated viscous transonic flow over a swept wing 
using an LU factored scheme, and Fujii, van Dalsem, Schiff, and Steger [21] have 
subsequently used this scheme to compute vortex breakdown for a strake-delta 
wing. Other vortical flow calculations include that of a planar delta wing at high 
angle of attack [22], unsteady flow about a hemisphere-cylinder [23], the hor- 
seshoe vortex around a blunt edge wing on a flat plate [24,25], an inviscid flat 
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plate [26], and an inviscid cropped delta wing [27]. A survey of high angle of 
attack, vortex-dominated flow computations may be found in Newsome and Kan- 
dil [28]. 

A slightly more challenging CFD problem, which seems to attract fewer 
attempts at solution, is three-dimensional transonic internal flow simulation, 
especially with applications to turbomachinery and turboprop engines. 
Geometric constraints and the relative motion of blade rows increase the 
difficulty of numerical solution. An approach taken by Rai, using patched grids 
in relative motion, has yielded impressive results for rotor-stator interaction 
[4,29]. Chima [30] has solved a quasi- three-dimensional form of the governing 
equations with an explicit multigrid scheme for various applications including 
centrifugal impellers, radial diffusers, and other axial turbomachinery com- 
ponents. Adamciyk and Graham [31] have derived the "averaged-passage" form 
of the Navier-Stokes equations to facilitate simulation of the flow through a mul- 
tistage turboprop. 

Time-dependent three-dimensional Navier-Stokes flows about oscillating air- 
foils have been simulated by Nakamichi [32]. Van Dalsem [33] has simulated a 
jet in ground effect with a crossflow, representative of V/STOL operation, using 
a fortified Navier-Stokes scheme. 

Rotorcraft computations, by their nature, require solution of the unsteady 
form of the equations of motion. The forces resulting from the interaction of hel- 
icopter blades with vortices, in both hover and forward-flight conditions, and the 
aeroacoustic characteristics of the flow are features desired from a computation. 
Chen, McCroskey, and Ying [34] have simulated inviscid multiblade rotor flows 
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using patched grids. Srinivasan and McCroskey [35] investigated tip vortex 
rollup with a Navier-Stokes solution in quasi-steady conditions. Strawn, Purcell, 
and Caradonna [36] couple the full potential equations with an integral acoustics 
code to perform helicopter sound studies. 

An area which has lately seen renewed interest is hypersonics. The two pri- 
mary applications in this flow regime are hypersonic transport vehicles (such as 
the National AeroSpace Plane) and reentry vehicles. A hypersonic vehicle design 
cannot be tested at flight conditions using current wind tunnel capabilities; it 
must, therefore, rely heavily on computational aerodynamics. Li [37] has com- 
puted the hypersonic flow about conical aerobrake bodies, Bardina and Lombard 
[38] have simulated the flow about a reentry vehicle, and Stoufflet, Periaux, 
Fezoui, and Dervieux [39] have done computations for the Hermes space vehicle 
using finite elements. Dwoyer and Kumar [40] present an overview and summary 
of recent progress in the computational analysis of hypersonic airbreathing air- 
craft flow fields. White and Drummond [41] focus on scramjet applications of 
CFD. Shang and McMaster [42,43] have investigated the interaction of jets with 
hypersonic crossflows. Additional recent hypersonic computations are reported in 
Refs. 44, 45, and 46. 

A discussion of Euler and Navier-Stokes computations is incomplete without 
some mention of the numerical methods for CFD in an historical perspective. 
The evolution of computational fluid dynamics has been driven by the need to 
solve flow problems more complicated than those for which analytical solutions 
can be found. Numerical approximations to the partial differential equations 
which govern fluid motion have a much wider range of application than do exact 
solutions of simplified subsets of these equations. 
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In the early 1900s, Richardson (47] used finite difference operators to solve 
elliptic equations iteratively. Courant, Friedrichs, and Lewy [48] made a major 
contribution to the theoretical development of finite difference methods in 1928 
when they used finite differences to prove existence and uniqueness theorems for 
PDEs. In a paper describing the work, they state what is popularly referred to 
as the CFL condition for stability: that the finite difference domain of depen- 
dence must contain the continuum domain of dependence. 

In the 1940s, Southwell [49] developed a relaxation method which was per- 
formed by hand calculation. Emmons [50-52] obtained the first solutions with 
embedded shock waves by applying hand-relaxation procedures to transonic 
problems. 

With the introduction of the first electronic computers, interest in the solu- 
tion of transient problems (those governed by parabolic or hyperbolic PDEs) sup- 
planted the original emphasis on elliptic problems. Von Neumann [53] contri- 
buted to the stability analysis introduced by Courant, Friedrichs, and Lewy by 
establishing stability criteria, and he and Richtmyer [53,54] successfully used dis- 
sipative techniques to capture shocks. 

In 1954, Lax [55] examined the use of the conservation form of the governing 
partial differential equations. Controversy developed in the late 1950s with 
Morawetz’s claim [56-58] that shock-free flows about 2-D profiles are in general 
isolated, thus negatively impacting airfoil design. The consensus was, however, 
that in a sufficiently large neighborhood of these design conditions, wave drag for 
shock-free airfoils was negligible. 
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In 1970, Magnus and Yoshihara [59], following ideas put forth by Crocco [60], 
obtained solutions to the inviscid steady flow problem by solving the unsteady 
equations with steady boundary conditions (a purely hyperbolic problem), 
thereby avoiding the difficulties which arise in solving the mixed elliptic- 
hyperbolic equations governing steady flow. An alternative approach to this prob- 
lem was taken by Murman and Cole [61], who used type-dependent differencing 
with an iterative solution procedure to treat the dual character of the steady 
equations. 

In 1969, MacCormack [62] presented a two-step explicit Lax-Wendroff scheme 
to solve the time-dependent Navier-Stokes equations, which is now quite popular 
due to the fact that it has a low operations count and maps well onto many 
parallel and vector computer architectures. 

The Beam-Warming [63] implicit algorithm, introduced in the mid-1970s, 
eliminated the need to adhere to the CFL stability limitation, as is the case with 
explicit schemes. This and other implicit methods form the basis for many 
efficient implementations on scalar computers as well as vector machines. 

Around 1980, approximate Riemann solvers [64] began to be popularized as 
highly-accurate techniques for solving hyperbolic systems of equations. Con- 
currently, upwind schemes were developed for fluid dynamics applications, and 
the use of these two related approaches for the solution of flows with strong 
shocks has since steadily increased [65]. 

Total variation diminishing (TVD) schemes, developed by Harten [66] and 
Chakravarthy and Osher [67] in the early 1980s, generate second order accurate 
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solutions with sharp nonoscillatory approximations to shocks and contact discon- 
tinuities for hyperbolic conservation-law systems. 

Design applications generally lag research by about ten years. In the 1960s 
designers used panel methods for solving the linear inviscid equations for shock- 
free applications. The nonlinear potential equations were implemented as design 
tools in the 1970s, and presently the Euler formulations are gaining acceptance 
with the availability of high-speed computers. 

Fundamentals for the numerical solution of partial differential equations may 
be found in Richtmyer and Morton [68], Richtmyer [69], and Mitchell [70], while 
the applications of such methods to the equations governing fluid dynamics are 
emphasized in Roache [71], Anderson, Tannehill, and Pletcher [72], and Ames 
[73], among others. 

1.2.2 Multigrid Convergence Acceleration 

The use of multigrid (or multiple-grid) schemes to accelerate the convergence 
of numerical methods for solution of the time-dependent and steady-state forms 
of the Euler and Navier-Stokes equations began with Ni’s work in 1982 [74]. He 
used a series of coarse grids on which a distribution scheme was applied to 
accelerate a one-step Lax-Wendroff algorithm for steady-state inviscid problems. 
Johnson extended this scheme to the entire class of Lax-Wendroff algorithms [75] 
and to the Navier-Stokes equations as well [76]. Jespersen [77,78] and Jameson 
[79,80] also investigated the application of multigrid to the solution of the Euler 
equations. Jameson implemented multigrid using a multistage time-stepping 
scheme on all grids, with careful use of dissipation operators to ensure 
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convergence. Stubbs [81] applied Ni’s scheme to implicit methods, while Jameson 
and Yoon [82,83] extended the more classical form of multigrid [84] in the same 
way. Johnson [85] further modified Ni’s scheme to eliminate the computation 
and storage of the Jacobians with the flux-based scheme, Koeck and Chattot [86] 
extended multiple grids to inviscid three-dimensional flows, and Johnson and 
Swisshelm [87] applied multigrid to three-dimensional viscous flows. Jespersen 
[88] showed that multigrid can be used to accelerate time-dependent solutions. 
More general information about multigrid may be found in Refs. 85 and 89-93. 

Hemker and Spekreijse [94,95] apply multigrid to the solution of the steady 
Euler equations. This approach differs from Ni’s scheme in that the multigrid is 
used as a fast solver for matrix inversions instead of as an accelerator of a time- 
stepping solution procedure. Mulder [96] and Shaw and Wesseling [97] have used 
similar techniques. 

Hall [98] has used a cell-vertex multigrid method for acceleration to the 
steady-state solution of the Euler equations using a Lax-Wendroff scheme, and 
Salmond [99] has extended this method to the three-dimensional case of a wing 
geometry. 

Peres et al. [100] and Lallemand and Dervieux [101] have applied multigrid to 
finite element methods in the solution of the two-dimensional Euler equations. 
Jameson and Mavriplis [102] and Mavriplis [103] have implemented multigrid 
with an unstructured mesh finite volume scheme with good results. 

Parallel multigrid schemes were introduced by Johnson, Swisshelm, and 
Kumar [104] for solving the Euler and Navier-Stokes equations, and by 
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Greenbaum [105] for application of classical multigrid. Both techniques are 
based on decoupling the coarse grids and distributing their computation across 
multiple processors. Frederickson and McBryan [106] have proposed a multigrid 
scheme suitable for massively-parallel SIMD systems. 

Several multigrid methods have been developed for use in embedded grid 
refinement solution procedures. Bassi, Grasso, and Savini [107] and Johnson, 
Swisshelm, Pryor and Ziebarth [108] used this technique to solve two-dimensional 
Navier-Stokes flows. Johnson and Swisshelm [109], and Swisshelm and Johnson 
[110] have extended this idea to three-dimensional in viscid and viscous flow com- 
putation. 

Uses of multigrid in adaptive grid schemes will be further discussed in the 
next section, including the work of Berger and Jameson [111], Thompson and 
Ferziger [112], Woodward and Colella [113], and Dannenhoffer and Baron [114]. 

1.2.3 Embedded Grids and Zonal Schemes 

Grid generation for complete configurations is one of the major pacing items 
in computational fluid dynamics [115]. One means of overcoming the difficulties 
presented by complex geometries is to partition a computational or physical 
domain into some reasonable collection of individually-gridded subregions. These 
types of techniques fall into two main categories: patched grid schemes and over- 
set or overlaid grid schemes. 'Patched” implies some kind of alignment between 
subregions at their shared boundaries, while 'overlaid" grids are generally topo- 
logically dissimilar with minimal commonality of grid points between subregions. 
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Patched grid systems may consist of local refinements to some global coarse 
grid, or two grids which are aligned at their shared boundary and move relative 
to one another. Usab [116] used a local mesh embedding technique to achieve 
two-dimensional Euler solutions with a scheme based on Ni’s multigrid conver- 
gence acceleration method. Holst and coworkers applied grid refinements in the 
solution of viscous transonic wing flows [18] and simulations for fighter-like 
configurations [3]. Swisshelm, Johnson, Pryor, and Ziebarth implemented a mul- 
tigrid embedded scheme for internal flow applications for two dimensions [108] 
and Swisshelm and Johnson extended it to three dimensions [110]. 

A variation on the aforementioned scheme is one in which the grid regions 
are allowed to move relative to one another. Rai has applied this idea with good 
results for two-dimensional rotor-stator interactions [117] and has extended it to 
the three-dimensional case including the hub and tip effects common in tur- 
bomachinery [4]. 

Local refinements may also be used in an adaptive sense; that is, as physical 
flow features evolve, grid points are added during the computation to enhance 
resolution in selected areas (for example, to resolve viscous effects or to sharpen 
discontinuities). Berger and Oliger [118] have applied an adaptive method to the 
solution of hyperbolic partial differential equations, and Berger and Jameson 
[111] have applied it to the Euler equations. Bell, Colella, Trangenstein, and 
Welcome [119] have solved a blast wave problem with this scheme. Thompson 
and Ferziger [112] presented another adaptive multigrid refinement scheme for 
the incompressible Navier-Stokes equations. Kallinderis and Baron [120,121] 
have extended Dannenhoffer and Baron’s work [114] to include automatic embed- 
dings for viscous and inviscid flows in two dimensions. Bassi, Grasso, and Savini 
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[107] calculate two-dimensional transonic Navier-Stokes flows using an adaptive 
method which couples a fine-grid Runge-Kutta scheme with multigrid. 

Nakahashi [122] presented a novel approach to the use of patched grid sys- 
tems in 1986 by using disjoint rectilinear finite-difference meshes around bodies 
which were connected with finite-element-type point distributions (i.e., triangula- 
tions). This enabled the resolution of viscous terms near surfaces for two- 
dimensional flows while providing the flexibility of a totally unstructured mesh. 
Nakahashi and Obayashi extended this composite grid scheme to three- 
dimensional problems including wing/nacelles [123] and multiple bodies [124]. 

Nakahashi and Deiwert developed a unique approach to grid adaptation 
based on a spring analogy [125,126], and adaptive finite element grids have been 
presented in Morgan [127] and Lohner et al. [128,129]. 

The most common implementation of overset meshes found in the literature 
is called the Chimera grid scheme [130], used to resolve the flow about multiple- 
body configurations. Benek, Steger, and Dougherty [131] used such a scheme for 
the two-dimensional Euler equations applied to a multielement airfoil, and 
Eberhardt and Baganoff [132] improved the scheme by treating the intergrid 
boundaries with the method of characteristics. Dougherty, Benek, and Steger 
[133] applied it to multiple bodies in relative motion (the store separation prob- 
lem), and Benek, Buning, and Steger [16] extended it to three-dimensional 
wing/body/tail and multiple-body cases, while Benek, Donegan, and Suhs [134] 
have computed 3-D viscous flows. 
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General information on grid generation for CFD may be found in Thompson 
[135,136]. 

1.2.4 Parallel Processing 

Increased computational capability, in addition to improved algorithm 
efficiency, is necessary for complex flow simulations. Very large memories and 
fast processing speeds are required. As the speed of component technology 
approaches physical limitations [8], advanced architectures become more impor- 
tant in the evolution of computers. 

In this section, recent work on those aspects of parallel processing which are 
relevant to CFD is reviewed. A comprehensive survey of all research in this area, 
although perhaps interesting to the reader, is beyond the scope of this thesis. 
For further information, one is referred to the following books. A good overview 
of parallel-processing concepts is found in Hockney and Jesshope [137]. The 
material they present is augmented by illustration of the principles as applied to 
specific architectures. A more hardware-oriented discourse is contained in Stone 
[138]. Here, the fundamentals of interconnection schemes and memory hierar- 
chies are discussed. The proceedings of any of the numerous annual conferences 
on this subject, such as the International Conference on Parallel Processing, may 
be enlightening. Dongarra and Duff [139] have complied a bibliography of 
advanced-architecture super- and superminicomputers, with summaries of impor- 
tant characteristics and performance. 

The main factor affecting multiprocessing, the amount of non-parallelizable 
work present in an algorithm, was first addressed in equation form by Amdahl 
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[140] , with what has become commonly referred to as Amdahl’s law. Worlton 

[141] , in a thoughtful treatise on the subject of parallel processing, has taken 
Amdahl’s relationship and expanded it to take into account inefficiencies that 
arise in multiprocessing due to synchronization overhead, load balancing, and 
task overhead. He further contributes to the reader’s enlightenment by examin- 
ing several limiting cases, thus drawing some useful conclusions about the realis- 
tic amount of performance to be gained by parallel processing. 

Gustafson [142], on the other hand, puts forth the hypothesis that instead of 
viewing parallel processing as limited by Amdahl’s law, the observation can be 
made that problem size is generally scaled up with increasing numbers of proces- 
sors. This "scaled speedup" concept assumes that the amount of parallel work in 
a problem will increase linearly with problem size, while the amount of sequential 
work remains more or less fixed. Hence, he claims that parallel efficiency should 
not suffer as much as Amdahl’s law would imply. 

Martin and Mueller- Wichards [143] present an overview of current practice in 
supercomputer performance evaluation and propose a set of metrics for the char- 
acterization of algorithms and architectures and their interactions, although they 
do not actually apply these principles. Entire conferences (for example, SIG- 
METRICS ’88 [144]) are currently devoted to the topic of performance evaluation 
for parallel computer systems. 

The first use of advanced architectures for CFD came in the mid-1970s with 
the ILLIAC IV parallel processor [145] and the STAR-100 vector processor [146]. 
The ILLLAC IV was an array of 64 processing elements, each with 2048 words of 
memory and a 240 nanosecond cycle time. The STAR- 100, precursor to the 
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Cyber 205, had a peak speed of 100 Mflops, and high memory bandwidth which 
was achieved through eight-way interleaving of the memory banks and pipelined 
transmission of data to the vector processor. 

Various researchers have adapted fluid dynamics applications to parallel- 
processing computers. Those using Cray’s multitasking library include Patel, 
Sturek, and Hiromoto [147], for a viscous axisymmetric ramjet problem, and 
Swisshelm and Johnson [110] and Swisshelm [148] for a three-dimensional 
Navier-Stokes zonal multigrid code. Mulac, Celestina, Adamczyk, Misegades, 
and Booth [149] implemented microtasking on a Cray X-MP to parallelize a tur- 
boprop flow solver. Misegades, Krause, and Booth [150] likewise microtask three 
applications codes: an unsteady viscous nozzle, a three-dimensional viscous wing 
code, and a three-dimensional Euler solver for a fighter configuration. Green- 
baum [105], McCormick and Quilan [151], and Johnson and Swisshelm [109] have 
investigated the implementation of multigrid methods on multiprocessors. 

Very impressive performance results on a massively-parallel distributed sys- 
tem have been obtained by Gustafson, Montry, and Benner [152] for a set of 
applications, including a flux-corrected-transport algorithm, on a 1024-processor 
NCube. Measured speedups of over 500, compared to uniprocessor execution 
times, have resulted in greater optimism regarding the use of massively-parallel 
systems for real codes. Included in this work is a thorough analysis of important 
parallel-processing issues. 

Others have also successfully used distributed-memory parallel systems for 
CFD applications. Catherasoo [153] has implemented the vortex method and 
Bassett and Catherasoo [154] have implemented a finite-volume Runga-Kutta 
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scheme on an Ametek 2010 hypercube. Erlebacher, Bokhari, and Hussaini [155] 
simulate compressible transition using a Flex/32. Fatoohi and Grosch use both a 
Goodyear MPP and the Flex/32 to solve the Cauchey-Riemann equations [156], 
to test a parallel alternating-direction implicit (ADI) scheme [157], and to simu- 
late two-dimensional incompressible Navier-Stokes flow [158]. Naik and Ta’asan 
[159,160] also compute two-dimensional incompressible flow using multigrid on a 
hyper cube. 

Murman, Modiano, Haimes, and Giles [161] and Modiano [162] have exam- 
ined the performance of isolated CFD code components on several machines, 
including the Alliant and the Intel hypercube. Asserting the belief that it is 
important to optimize the most common CFD loops, they parallelize the pressure 
computation, the flux summation, and the tridiagonal inversions used for implicit 
smoothing in convergence acceleration. 

Recently, Levit and Jespersen [163] have implemented a two-dimensional 
compressible Navier-Stokes code (for shear flow computations) on a Connection 
Machine. This massively-parallel system consists of more than sixteen thousand 
bit-serial processors connected in a hypercube topology. The small amount of 
local memory forces grid points to be distributed such that each node contains 
only a few points. Their implementation of ARC2D requires replacement of the 
Thomas algorithm for a tridiagonal matrix inversion with odd-even elimination 
(cyclic reduction) to attain efficient parallel execution. Frederickson and 
McBryan [106,164] have also achieved high processing rates for a multigrid solver 
on the Connection Machine. 
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Several surveys of recent progress in algorithms and applications codes for 
scientific computing have been written. Ortega and Voight [165] simply collect 
all vector and parallel algorithm work in bibliographic form. Johnson [9,166] has 
gathered actual performance data for parallel scientific applications and has 
attempted to filter it and categorize it in terms of shared and distributed memory 
systems. He too discusses fundamental concepts relating to Amdahl’s law. 

Some software tools have been developed to aid parallel algorithm and code 
development. These include Fortran language extensions such as CoFortran [167] 
and SCHEDULE [168]. Graphical analysis tools to facilitate debugging multi- 
tasked code have been created by Dongarra and Sorensen [168], Seager et al. 
[169] and others [170]. 

Research into parallel languages is being conducted. Some examples dis- 
cussed at a recent conference include Ada, Haskell, LGDF-2 (large grain data 
flow), Sisal, and Unity [171]. 


1.3 Scope 

The present research entails testing the feasibility of several computational 
approaches and the suitability of their combinations. Though not an exhaustive 
investigation, a path was chosen that, in the author’s experience, would yield the 
most useful and promising results. Since explicit finite-difference schemes are 
quite easy to parallelize, methods of this type are used almost exclusively. How- 
ever, while efficient multiprocessing is a key consideration in the derivation of a 
method, algorithmic efficiency is no less important. Thus, the very robust, yet 
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slowly converging, basic scheme used here is made much faster with the applica- 
tion of multigrid. Awareness of the need to conserve both operations and storage 
is manifested in the introduction of the zonal equation and embedded refinement 
aspects of the algorithm. The need for an accurate, stable scheme is self-evident. 
Hence, the intent is to demonstrate the feasibility of a three-dimensional solution 
procedure which will continue to evolve - as enhancements are made to algo- 
rithms (e.g., TVD schemes), multigrid techniques, and adaptive grid schemes, 
and as parallel computer architectures are refined and become larger and more 
powerful. 

A set of model problems for two- and three-dimensional testing was chosen to 
provide some flow complexity (internal problems) without requiring extensive 
geometry definition or gridding expertise. The emphasis is not fluid phenomena; 
instead it is on the algorithm - architecture interaction for a three-dimensional 
flow simulation procedure which is applicable to complex configurations. 

Preliminary testing of each component of the algorithm is carried out in 2D, 
extended to 3D, and then integrated into the 3D solver. Flow results are 
presented, but the main focus is on performance, i.e., the goal is that the flow 
results remain invariant while the simulation time is reduced by various tech- 
niques employed in the procedure. 

This work is presented in the following chapters. First, the equations of 
motion, which represent the mathematical problem to be solved, will be dis- 
cussed. The solution scheme, including the basic flow solver, multigrid conver- 
gence acceleration, embedded grid refinements, and the zonal application of the 
equations, and the synthesir of these elements into a unified algorithm is then 
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described. Next, topics associated with parallel processing and supercomputing 
are mentioned, with both algorithmic or software considerations and hardware 
architectures analyzed. Both aerodynamic flow results for the model problem 
and performance of the algorithm and its parallel implementation are given. 
Conclusions are then drawn, and, finally, a discussion of future directions for this 
research is presented. 



2. Equations of Motion 


While solution to the full viscous form of the equations of motion is ulti- 
mately desired, computational capabilities limit the scale of resolution currently 
possible for three-dimensional flow. The equations used to model aerodynamic 
flows of interest are often simplified relative to the full Navier-Stokes equations 
which contain all scales of turbulence. An examination of the actual physics in 
different regions of a solution domain can reveal that solution to the most com- 
plex model equations is not always necessary to capture the essential flow 
features. This motivates the development of the current flow solver which incor- 
porates several levels of modeling for a single simulation. 

Three forms of the equations of motion governing fluid flow in the continuum 
regime are presented in this chapter. The Reynolds-averaged form of the 
unsteady Navier-Stokes equations, their thin-layer version, and the Euler equa- 
tions are each defined and discussed. 


2.1 Euler Equations 

The Euler equations define the motion of a inviscid, nonconducting fluid, 
and may be written in conservation form and Cartesian coordinates as 


9 / + f, + + b, = 0 
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where the vector of conserved quantities, q, and the flux vectors, /, g, and h, are 
defined as: 
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The subscripts t, x, y, and z represent partial derivatives in time and space, and 
the variables p, u, u, w, p and E are density, velocity in the x-, y-, and z- 
directions, pressure, and total energy per unit volume, respectively. It has been 
shown by Viviand [172] that these equations maintain their strong conservation 
form under arbitrary transformation of coordinates. Their generalized coordinate 
form, which is used in the present research, is presented in Appendix A. 


The Euler system consists of five equations which represent conservation of 
mass, momentum, and energy for inviscid flow. The mass conservation, or con- 
tinuity, equation represents a balance between the change in mass contained in a 
volume, fixed in space and time, with the mass flux into and out of that volume. 
Likewise, the three momentum equations represent the same relationship for x-, 
y-, and z-momentum. The change in total energy in a fixed volume plus the 
total energy flux must similarly be balanced by the work done by external forces, 
internal stresses, and heat flux into a volume [72]. 

The Euler equations in their unsteady compressible form are classified as 
hyperbolic. Hyperbolic systems may be numerically solved by time marching, or 
evolution, methods. In their steady form, these equations are classified elliptic- 
hyperbolic for subsonic flow and hyperbolic for supersonic flow. The system in 
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this form is difficult to solve, because elliptic equations require a different numeri- 
cal approach than hyperbolic equations. Hence much of the success in computing 
steady solutions to subsonic and transonic flow problems has come from solution 
of the unsteady system. For incompressible flow, the unsteady equations are 
elliptic-hyperbolic, while in steady form they are elliptic. In contrast with the 
compressible equations, the steady incompressible system is relatively easy to 
solve with an iterative (or possibly semi-direct) elliptic solver, while the mixed- 
type unsteady system is often attacked by using an evolution scheme which 
solves an elliptic system at each time step, i.e., embedded in the hyperbolic 
solver. 

Additional equations are necessary for closure of the Euler system of equa- 
tions. Specific internal energy, e, may be related to the pressure and density by 
assuming a calorically-perfect gas 

P = (7 ~ 1)P« 

where y denotes the ratio of specific heats and the total energy per unit volume 
is given by 


E = p 


, 1 , 2 , 2 , 2 \ 
e + (u + t> + tv ) 


to complete the set of equations. 
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Auxiliary conditions are necessary to complete the specification of an Euler 
problem in the presence of flow field boundaries. At surfaces, the flow is required 
to be tangent to the boundary. At arbitrary flow boundaries, such as inlets, 
exits, or at far-field boundaries, the mathematical properties of the partial 
differential equations are examined to determine the number of conditions to be 
specified. The eigenvalues of the coefficient matrices of the partial differential 
equations are used to determine the characteristic directions. At arbitrary flow 
boundaries, one condition must be specified corresponding to each incoming 
characteristic. For the two-dimensional Euler equations, the characteristics are 
v/u, v/u, u + e, and u - e. At a flow inlet, subsonic flow (u less than c) requires 
three conditions and supersonic flow (u greater than e) requires four. At an exit 
or outflow boundary, one condition for subsonic flow is needed, while zero condi- 
tions for supersonic flow are necessary for a mathematically correct problem. As 
the initial condition, the five independent variables need to be specified over the 
entire flow domain. 


2.2 Navier-Stokes Equations 

The three-dimensional Reynolds-averaged Navier-Stokes equations may be 
written in conservation-law form as 


It + F s + G y + = ° 


where subscripts t, x, y, and z again represent partial derivatives, F, G, and H 
are given by 
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F = / - Re l d, G = g - Re l r, H = h - Re l », 


and the vectors q, f, g, h are as previously defined, while d, r, and a are defined 
as: 
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The terms in the viscous flux vectors represent 

T « = X K + v y + w z) + 2 V- U a T «» = V = + u ,) 

T w = X(u 2 + v y + W t ) + 2\lv v T „ = T « = M-K + wj 

T « = X K + + w z) + 2 * W z T yz * T «y = + %) 
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3, = yK Pr~ 1 e t + ut s + vi v + un u 

Note that the subscripts to t do not represent partial derivatives. The total 
energy per unit volume is defined as before, as is the specific internal energy, e, 
from the perfect gas law. The coefficient of thermal conductivity, k, and the 
viscosity coefficients, X and pi, are assumed to be functions only of temperature. 
Furthermore, X is expressed in terms of the dynamic viscosity, jl, by invoking 
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Stokes’ assumption of zero bulk viscosity 

2 

k = ~ “ |t 
3 

Re and Pr denote the Reynolds and Prandtl numbers, respectively. 

The following assumptions are made in the derivation of the Navier-Stokes 
equations: 1) the fluid is a continuum, 2) the fluid is Newtonian (stresses are 
dependent only on first derivatives of velocity and the fluid is isotropic), 3) the 
fluid has zero bulk viscosity, and 4) heat conduction behaves according to the 
Fourier law, i.e., is directly proportional to the first derivative of the tempera- 
ture. 

Additionally, the Navier-Stokes equations are used here in Reynolds-averaged 
form. By averaging the equations over some time interval, which is long relative 
to turbulent eddy fluctuations yet short relative to the macroscopic changes in 
the flow field, additional terms, often referred to as Reynolds’ stresses, result. 
The original quantities in the partial differential equation system are transformed 
to become the mean flow variables. The turbulent shear stresses, representing in 
some sense a time-averaged transport of momentum and energy [173], must be 
modeled when the Reynolds-averaged Navier-Stokes equations are used to predict 
turbulent flow. Following the Boussinesq [174] assumption of an apparent eddy 
viscosity, the present research uses the two-layer algebraic eddy viscosity tur- 
bulence model of Baldwin and Lomax [175], which is patterned after that of 
Cebeci [176]. No additional partial differential equations are introduced into the 
system by this model. Information from the mean flow variables is used to alge- 
braically compute p- T and k^, which together with p. and k form the turbulent 
viscosity and heat conduction coefficients. 
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The Reynolds-averaged Navier-Stokes equations may be used to simulate 
flows with separation, and to determine buffeting and total drag with a 
sufficiently sophisticated turbulence model [173]. A closer approximation to the 
full Navier-Stokes equations is achieved with an approach known as large eddy 
simulation, where large-scale turbulent motion is solved numerically from first 
principles and only the small, dissipative eddies are modeled [72], Large eddy 
simulation requires several orders of magnitude more computing power than the 
simulation of the Reynolds-averaged Navier-Stokes equations and still must have 
a sub-grid-scale turbulence model. This model can be eliminated by performing 
direct simulation of all scales of turbulent motion. 

Computational requirements (in terms of memory and speed) for these levels 
of simulation are presented in Figure 1.1 [5]. The bold line represents what 
resources are currently available. A performance improvement of 10 times will 
be necessary for large eddy simulations of realistic three-dimensional problems, 
and a 10* -fold increase will be required for simulations which directly resolve all 
turbulence scales. 

The Navier-Stokes equations may be mathematically classified as follows: In 
their compressible form, the unsteady equations are hyperbolic-parabolic and the 
steady equations are elliptic-hyperbolic. The incompressible equations are 
elliptic-parabolic in unsteady form and elliptic in steady form. As previously 
noted, partial differential equations which change type along a priori unknown 
boundaries in the flow field are more difficult to solve because of the need for 
type-dependent differencing. The unsteady compressible Navier-Stokes equations 
are solvable by the same methods as the time-dependent compressible Euler 
equations. 
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For a solution to these equations to be properly specified for some bounded 
domain, additional mathematical conditions must be prescribed. Initial values of 
the conserved variables (of which there are five in 3-D) must be specified at all 
points in the field. At flow boundaries, the number of conditions specified is 
determined by analysis of the characteristics at those boundaries (see Section 
2.1). At solid boundaries, heat flux or temperature is fixed, and velocity normal 
and tangent to the boundary is specified to be zero - the impermeable and no-slip 
wall conditions. At arbitrary flow boundaries, the number of incoming charac- 
teristics determines the number of variables to be fixed. As for the Euler equa- 
tions, subsonic flow requires specification of three variables at the inlet and one 
at the outlet, while supersonic inflow and outflow require four at the inlet and 
zero at the exit. 

The thin-layer form of the Navier-Stokes equations can be written in Carte- 
sian coordinates as 


Q t + f x + 9 y + = Re 1 S f 


where q, f, g, and h are defined as for the Euler and full Navier-Stokes equations, 
and S is 
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The generalized coordinate form of S is found in Appendix A. 
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In the thin-layer equations, the viscous terms with derivatives parallel to 
body surfaces are neglected, while all other terms are retained. This is justified 
by an examination of what is really being computed on a grid that is highly 
stretched normal to a surface [175j. Gradients of flow quantities are rapidly 
changing in the boundary layer in the direction normal to the wall. A highly- 
clustered grid spacing can be sufficient to resolve these viscous terms. Coarse 
spacing between points in lines or planes parallel to a solid boundary will not 
resolve the flow gradients in those directions which are generally of small magni- 
tude and thus may be safely neglected. The thin-layer equations have the same 
mathematical type as the full Navier-Stokes equations, and are therefore solvable 
by the same methods. 

The systems of partial differential equations presented in this chapter may be 
approximated by a variety of methods; in this work, finite difference schemes 
applied on a structured, rectilinear grid (or hexahedral in 3-D) are used. The 
Euler, Reynolds-averaged Navier-Stokes, and thin-layer forms of these equations 
of motion are solved where appropriate for the physics of a particular flow region 
in the computational domain. In the next chapter, the details of the numerical 
flow solver are presented. 


3. Numerical Solution Procedure 


In order to numerically solve the three-dimensional Navier-Stokes equations 
efficiently for complex flows, algorithms must be developed that minimize compu- 
tational work, yet take maximum advantage of the architectural features of 
currently available and future generation supercomputers. In light of the increas- 
ing reliance on multiple processors to gain computational speed, these algorithms 
should, therefore, be highly parallelizable. With this in mind, the scheme 
described in this chapter has been created from a combination of numerical solu- 
tion techniques which are explicit, but that, when used in conjunction with each 
other, result in a fast and efficient means of obtaining solutions to the equations 
of motion. 

The building blocks of the algorithm are presented in the following sections. 
The basic fine grid flow solver, an explicit, two-step predictor - corrector scheme, 
is described, with details such as flow boundary treatment and the implementa- 
tion of artificial dissipation. A multigrid method is applied to accelerate the con- 
vergence to a steady-state solution. The problem of generating appropriate grids 
for complex flows is addressed through the application of embedded grid 
refinements. Computational work is further reduced by creating a zonal scheme 
in which the different levels of governing equations, in the hierarchy from Euler 
to Navier-Stokes, are applied to different regions of the field. Also discussed are 
the implementation of intergrid boundary conditions. The assemblage of these 
elements into a unified solution algorithm concludes the chapter. 
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3.1 Fine-Grid Flow Solver 

MacCormack’s explicit predictor - corrector scheme [62] was chosen as the 
basic flow solver for the present study. It is a widely-used, well-understood, sim- 
ple and robust two-step Lax-Wendroff algorithm. The intention is not to restrict 
the solution procedure unnecessarily by this choice, however, and it is conjec- 
tured that other methods, such as Runge-Kutta [177] or Beam-Warming [63], 
could be substituted in its place. Stubbs [81] has shown that multigrid may be 
applied to an implicit scheme, and Jespersen [88] has used it to accelerate time- 
dependent calculations. Note that either an implicit or an explicit method could 
be used as the basic solver, but that explicit schemes are generally more amen- 
able to parallel-processing. 

The finite-difference equations, second order accurate in both space and time, 
may be written as 
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where: 


H,j,k = [«(* + A 0 ~ «(0 ] i,j,t 

9 »,/,* ~ 9,,y f * + A ?«,>,* 


, /T, y>ik = 


The vectors q, F, G, and H have been defined in Chapter 2. Here, t, x, y, and z 
represent time and the three spatial dimensions, respectively. The superscript n 
denotes the discretized time level, where t = nAt. Likewise, the subscripts i, j, 
and k determine location in the discretized domain according to i = iA x, y = 
jAy, and z = kAz, where Ax, Ay, and A z are the grid spacings in the three coor- 
dinate directions. Illustrations of the two- and three-dimensional stencils formed 
by the scheme are shown in Figure 3.1. 


The MacCormack algorithm as represented by Eq. 3.1 is in "forward predictor 
- backward corrector" form. This means that the spatial derivatives of the flux 
vectors F, G, and H are forward-differenced in the predictor step and backward- 
differenced in the corrector step. The order of these two steps, as well as the 
direction of the discretization of the three spatial derivatives, may be alternated 
to prevent asymmetry arising from one-sided differencing, although this was not 
found to be necessary for the present applications. In order to maintain second 
order accuracy, the derivatives of the viscous terms in F, G, and H must be care- 
fully discretized [72]. The x-derivatives contained in F are differenced in the 
opposite direction of F (i.e., backward in the predictor and forward in the 
corrector in the current implementation), while the y- and ^-derivatives in F are 
approximated with central differencing. Likewise, y- and z-derivatives contained 
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in G and H, respectively, are differenced opposite to G* and H g , while central 
differencing is applied to the cross-derivative terms. 

The MacCormack scheme can be derived from the Taylor series expansion 

A* 2 

g(t + At) = q[t) + Atq t + q tt + 

2 

keeping the appropriate terms to insure second order accuracy in time. Spatial 
second order accuracy is attained with the appropriate differencing of the deriva- 
tives in x, y, and z. The complete derivation is contained in Appendix B. 

The flow solver is stable only for time steps (At) less than the limit imposed 
by the Courant-Friedrichs-Lewy (CFL) condition [178]. The ratio of time step 
size to mesh spacing is restricted in such a way that the domain of dependence of 
a point in the difference scheme must contain all points of its domain of depen- 
dence in the differential equation [179]. For the three-dimensional case, this limit 
may be expressed as 
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The derivation and generalized coordinate form of this equation are contained in 
Appendix C. The variable a represents the local speed of sound, and u, v, and to 
are the three components of velocity. The viscous time step limit is approxi- 
mated by neglecting the convection terms in the Navier-Stokes equations and 
performing a von Neuman a stability analysis on the linearized equations [180]. 
This limit may be written as 
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The variables |x, 7, and p represent dynamic viscosity, the ratio of specific heats, 
and density. 

Once the time step limit has been determined at each grid point as the 
minimum result from the two equations above, the following means of advancing 
the solution are available. If maintaining time accuracy at each iteration is 
desired, a global time step equal to the minimum At in the field is used at all 
points. When a steady state solution to the unsteady equations is needed, the 
requirement for time accuracy is relaxed, and the local time step at each grid 
location may be used. This second technique removes the stiffness in the system 
introduced by the stretched mesh and accelerates the convergence to a steady 
state solution because larger time steps may be taken in areas of the mesh where 
grid spacing is large. Local time stepping is used to obtain all flow results 
presented in this work. 

For transonic flow computations, artificial damping is added as a fractional 
step to the fine-grid solver so that shock discontinuities may be properly cap- 
tured. The additional terms introduced into the equations are of second order 
globally but are locally first order where the density gradient is large. Numeri- 
cally, they have the form: 


Aq D = Atv D [ Ax 2 { I p, I q x ) t + A !/*( I p v I q p \ + A z 2 { I p, I ?,), ] 
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The partial differential equation system is thus modified to be 

q t + F x + G v + H g + v D [ Ax 2 ( I p x I 0 Z ) X + Ay 2 ( I p y I 7 y )„ + Az 2 ( I p f I q g ) g ] = 0 

The coefficient v D is order 1. This equation is only applied during the computa- 
tion of supercritical flows. Its generalized coordinate form is presented in Appen- 
dix D. 

On very fine grids, such as those generated for the three-dimensional viscous 
flow applications, additional dissipation is necessary to stabilize MacCormack’s 
method. The very fine mesh spacing admits very high frequency error modes 
which are not adequately damped by the simple artificial viscosity terms 
presented above. Terms representing fourth partial derivatives of the conserva- 
tion variables in each direction are therefore added to the solution. This has the 
effect of modifying the partial differential equation to have the form 

i, + r, + a, + h, + n + As, 3 w + a-u] = o 

The cubes of Ax, Ay, and A z give the artificial terms proper dimensionality. 
Through a von Neumann stability analysis [181], it has been shown that p, = 
1/16 gives zero amplification of the highest frequencies in the solution for a 
representative one-dimensional, linearized model problem. 

The above expression for the artificial viscosity transformed into generalized 
coordinates may be found in Appendix D. It is applied as another fractional step 
after the MacCormack scheme, sweeping in each of the three spatial directions. 
The coefficient p. often needs to be tuned for grids with different mesh spacing, 
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since the model problem analysis gives an optimum value of p, only for a very 
restricted, one-dimensional case, and does not provide a good general guideline 
for multidimensional cases. This tuning is achieved through computational 
experimentation in the present research. 

As noted in Chapter 2, along viscous solid wall boundaries the no-slip condi- 
tion is applied, temperature or heat flux specified, and pressure determined from 
the normal momentum relation. This last equation, formed by combining the 
three transformed momentum equations, is written in generalized coordinate 
form as 

- pU(i,u f + + i t w t ) - pK(C,% + {,«, + £,<%) 

- tt.t. + %i, + 5,t>{ + + V. + ’i.O, + tti + C + Op( 

and is solved implicitly for pressure in the streamwise ( £ ) direction along each 
wall. Any other values not specified by the mathematical boundary conditions 
are computed with an explicit method, that is, extrapolation from the interior of 
the domain. 

The surface pressure boundary condition, as shown above, is the only non- 
explicit calculation in the current algorithm. By solving the pressure relation 
implicitly, symmetry of the solution (for a symmetric flow) is preserved along the 
boundary. This condition can, in fact, be replaced by an equivalent explicit 
discretization of the partial differential equation, thus removing the recursive 
implicit computation, which inhibits the vectorization of this boundary condition. 
The two-dimensional code has been thus modified and tested, and is discussed in 
Chapter 7. 
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3.2 Multigrid Convergence Acceleration 

Although MacCormack’s method is efficient and quite robust for solving the 
time-dependent equations of motion, its convergence to steady-state solutions is 
slow due to its explicitness and CFL-limited time-step size. One means of 
increasing the rate of convergence is to apply multigrid in the manner first intro- 
duced by Ni [74] and modified and extended by Johnson [75,76]. This multigrid 
algorithm may be used for Euler, thin-layer, and full Navier-Stokes computa- 
tions. By solving Lax-Wendro£f-type equations for 8 q (the change in the solution 
at each time step) on a collection of grids coarser than the base grid, and com- 
municating the results back to the base grid in the form of an additional update 
to the MacCormack solution, convergence of the fine-grid scheme is accelerated. 

The actual mechanism by which multigrid improves the convergence to 
steady state of Euler and Navier-Stokes solution procedures has been explained 
in two ways. It is postulated by some [75,76] that the use of coarse grids facili- 
tates the propagation of transients out of the computational domain. Others 
[182] have shown that the use of multigrid serves to damp out lower frequency 
errors in the solution by virtue of the fact that the coarse grids see these modes 
more easily than the fine grids. One could also conjecture that some combina- 
tion of these two theories are the basis of the success of the multigrid technique, 
and investigation on this subject is ongoing [183]. 

3.2.1 Coarse-Grid Schemes 

The multigrid method was originally devised as a way to lower the number of 
operations (or in other wordr , the cost) needed to compute a steady-state solution 
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on a machine with serial, von Neumann-type architecture. The sequential algo- 
rithm uses a series of successively coarser grids, created from some base grid by 
deleting every other point in each coordinate direction, to propagate information 
from the fine-grid solver throughout the computational domain. The number of 
points in each direction on the base grid is required to be expressible a s n( 2?) + 1 
for p a natural number and n an integer greater than or equal to 2, where p is 
the number of grid coarsenings, and n is the number of coarsest-grid intervals 
[ 1841 - 

The coarse grid scheme is derived from a one-step Lax-Wendroff method 
written in a coarse grid setting. Such a scheme may be expressed as 

At 2 

Ho«r„ = + ~1u 

Recalling that q t - ~{F t + G y + H g ) , this may be rewritten as 

At 2 

= ~W. + C, + «.) ~ Y (f « + + H ‘ ] < 

Observing that 


A q = -A f(F, + G y + H,) 
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we obtain the equation 




= A q 


Ar 


\F t + + *.)i 


(3.2) 


Various coarse-grid schemes may now be derived according to the way in which 
the second term on the right hand side of Eq. 3.2 is treated. If we let it be 
approximated as 

-(F, + G p + H,\ = [A(F, + G, + #,)]_ + [fi(F, + G„ + 

+ [C(F, + G p + 

where A, B, and C are the Jacobian matrices, 

dF dG dH 

A — , B — , C — , 

dq dq dq 

we obtain the class of Jacobian-based acceleration schemes, of which the method 
due to Ni [74] is a member. 

For example, a simple discretized Jacobian-based scheme for a coarse grid 
having double the spacing of some fine grid may be written as: 
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If this represented a single-grid Lax-Wendroff scheme, A q would be computed 
from 


^»+i, >+!,*+! + G y + 


For multigrid, instead of solving this equation by discretizing the flux balance on 
the coarse grid, we approximate A q with a restriction, R, of the most recently 
computed fine-grid correction 

* «(*«/».) 

The effect of this restricted fine-grid correction is then distributed according to 
Eq. 3.3 to obtain a coarse-grid correction. This correction, in turn, is prolonged 
to the fine grid, to become an additional update to the fine-grid solution. In two 
dimensions, Figure 3.2 contrasts the one-step scheme written on a fine grid with 
the coarse-grid scheme written on a grid with double the spacing. 

We observe that in the basic integration scheme a correction at one grid 
point affects only its nearest neighbors, while in a A-level multigrid scheme the 
same correction affects all points up to 2 mesh spacings distant. Further- 
more, since the coarse-grid scheme simply propagates the effects of the fine-grid 
corrections, the fine-grid solution accuracy is maintained. 


The Jacobians, A, B, and C, are five-by-five solution-dependent matrices 
needed at each grid point and each time step and are, therefore, costly to com- 
pute and store. To increase the efficiency of the multigrid scheme, Johnson [85] 
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has developed a flux-based coarse-grid scheme which does not use the Jacobian 
matrices. If we let the right hand side of Eq. 3.2 be approximated by 

(F, + G f + H t \ V j- [(F, + G, + H,)"" - (F, + C ( + //,)"] 

Li I 

thus yielding the equation 

H. m , = - y [(F, + G t + - (F, + C, + ff,r] 

A 


where 


F" = F(g n ) , G n = G{q n ) , H n = 

F n + l = F{q n + A q) , G" +1 = (?(*" + A?) , H n + 1 = #(?" + A?) 

we obtain a class of flux- based coarse-grid schemes. The quantity A q is again 
approximated by a restriction of the fine-grid value of hq and second-order accu- 
rate spatial differencing is used. The computation and storage of the five-by-five 
Jacobians has been replaced by spatial differences of the flux vectors, thus reduc- 
ing the cost of the multigrid scheme. 

3.2.2 Information Flow 

The information flow of the sequential multigrid scheme is depicted in Figure 
3.3. First, a fine-grid correction, 8 q^, is computed. Then 8 q* is restricted to the 
next-coarser grid, where hq g is computed. The hq g correction is both restricted 
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to grid S and prolonged to grid 1, where it provides an additional update to the 
fine-grid solution. On grids S through N-l the procedure is analogous to that on 
grid 2. When hq N has been computed and prolonged to grid 1 to provide the 
update to the fine-grid solution, the next multiple-grid cycle is ready to begin. 
Injection is used as the restriction operator for the sequential coarse-grid scheme, 
and linear interpolation is used as the prolongation operator. 

The parallel coarse-grid algorithm [104,184], illustrated by Figure 3.4, 
removes the dependence of grids $ through N upon their immediate predecessors, 
and creates a scheme which is particularly suitable for implementation on 
MIMD-type computer architectures. In contrast to the sequential multigrid algo- 
rithm, 8 q 1 is now restricted to each of grids 2 through N. All of the coarse grids 
may then be updated simultaneously and independently of each other. Averag- 
ing is used in the restriction operator for grids S through N, while the prolonga- 
tion operator remains linear interpolation. Testing has demonstrated that the 
convergence rate of the parallel coarse-grid algorithm is nominally the same as 
that of the sequential scheme. 

3.2.3 Implementation Details 

Dissipative effects have a local character and their influence need not be 
taken into account in the construction of coarse-grid schemes. Rather, it is the 
convective terms, with their global character, that are the key element in coarse- 
grid propagation. Hence, coarse-grid schemes for viscous flow computations may 
be formulated on the basis of the in viscid equations of motion [76]. Such a 
scheme leads to a multiple-grid convergence acceleration procedure which is 
independent of the nature of the dissipative terms retained in the viscous model 
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equations. In other words, a coarse-grid scheme based on the Euler equations 
may be employed, without modification, to accelerate the convergence of viscous 
flow computations based on the Navier-Stokes equations, the thin-layer equa- 
tions, or any other viscous model equations which contain the full inviscid Euler 
equations. The flux-based coarse-grid scheme may be written in convective form 
as follows: 


Hearn = ~ [(/* + + \)" + 1 ~ (/« + 9y + \)“ ] 

mt 


This version of the coarse-grid scheme is used in the present computational algo- 
rithm. 

Boundary conditions are enforced only on the fine grid. This has the advan- 
tage of decoupling the coarse-grid scheme from both the physical and numerical 
nature of these boundary conditions. That is to say: the coarse-grid scheme 
always sees a Dirichlet problem. Likewise, any numerical damping terms which 
may be necessary are also only applied on the fine grid. This enhances the 
modularity of the coarse-grid scheme. 


3.3 Embedded Grid Refinements 

In order to properly capture the physics of the flow field with a discretized 
solution procedure and a limited number of grid points, these points must be 
clustered in certain regions. Although the Euler equations, which include only 
inviscid terms, may be sufficiently resolved on a fairly coarse mesh, the full 



47 


Navier-Stokes equations require very fine mesh spacing to resolve the viscous 
terms (which are gradients of velocity gradients) near surfaces. 

One way to cluster grid points in specific regions of the flow is to use stretch- 
ing. Applying this technique to distribute grid points may be as simple as alge- 
braic or geometric calculations or as complicated as solving elliptic or hyperbolic 
partial differential equations. Although grid stretching is used to some extent in 
the present work, it has limitations and is not always sufficient for applications 
with complex geometries. Meshes which are highly stretched in one coordinate 
direction and relatively coarse in another may have cells with high aspect ratios, 
which can have a destabilizing effect on a numerical solution algorithm. Cluster- 
ing is difficult to achieve in the presence of complex domain boundaries, and con- 
centrating enough points for high resolution in one area may unnecessarily add 
points elsewhere in the field. 

Alternative methods for optimizing the grid point distribution in a complex 
flow field include overlaid or Chimera grids [133], zonal or patched grids [18,185], 
and the use of unstructured meshes [103]. A Chimera scheme has local meshes, 
each tailored to conform to one of the bodies in a multibody configuration, over- 
laid on some global mesh. This technique has been used successfully to resolve 
the flow about multi-component airfoils and to compute time-dependent store 
separation problems in two dimensions [133]. Three-dimensional applications 
include flow simulation about wing/tail/body and multiple-body configurations 
[16]. Zonal or patched grids cover a computational domain with only minimal 
overlap at interfaces between regions, such that the information transfer between 
grid zones is less complicated than with the Chimera scheme. Moving patched 
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grids have been used extensively by Rai (185] to simulate time-dependent rotor - 
stator interaction in two and three dimensions. Holst et al. [18] have used a 
tonal mesh arrangement, similar in some respects to the technique of this work, 
for an internal flow computation. Unstructured grids have been applied to suc- 
cessfully compute the in viscid flow about a commercial transport [2], and their 
accuracy as compared to that of structured meshes is currently being studied 
(186, 187]. 

We have chosen to explore the feasibility of introducing embedded 
refinements into a global (and relatively coarse) mesh, in the same manner as 
presented by Usab [116] or Kallinderis and Baron (120], as one method of create 
ing acceptable grids for complex geometries. The locations of the embeddings are 
defined a priori, drawing upon our experience with similar flow solutions. This 
process, of course, could be automated, and also could be used in an adaptive 
grid scheme (see, for instance, Berger [188]). In the current algorithm, each grid 
embedding is required to contain as a subset any underlying coarser grid points. 
In other words, refinements are formed by successive halving the grid spacing in 
each coordinate direction. This is illustrated in Figure 3.5 for a simple two- 
dimensional domain. 

Extensive testing has been performed on the two-dimensional test case [108] 
using the grid arrangement depicted in Figure 3.5. The model problem consists 
of a sting-mounted blade in a channel. A global coarse grid was defined, and the 
first embedding placed to facilitate shock capture in supercritical flow, while a 
further refinement was positioned along the viscous wall boundary for high reso- 
lution of the boundary layer. 
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A simple embedding arrangement has been created to test the zonal grid 
refinement scheme in three dimensions. A single refinement region is defined, 
which covers the area above the airfoil to the exit, and spans the entire 
transverse direction. In this problem, the intergrid boundary conditions, work 
reduction by grid point elimination, and the zonal application of the 
Euler/Navier-Stokes solver can be studied without the additional complexity of 
managing many refinement regions. This 3-D model problem is illustrated in 
Figure 3.6. 

The three-dimensional application which has motivated the development of 
this algorithm, a model of the geometry of a turbomachinery blade row, is illus- 
trated in Figure 3.7. Highest resolution of the flow solution is required along the 
blade surface and in the corners created by the juncture of the endwalls and the 
blade. Here, three-dimensional viscous effects dominate the flow, and the viscous 
terms in the equations of motion require high resolution and therefore a very fine 
grid (one which would be equivalent to a 129x33x33 global mesh). Figure 3.8 
shows the positioning of the embedded refinements for a typical complex 3-D 
computation. Figure 3.8a depicts the global coarse grid (33x9x9), and 3.86 and c 
show the two subgrids, each refinements of the global grid by factors of two and 
four in all dimensions, respectively. Figure 3.9 presents the assemblage of this 
set of grids. Further details about the test problems are contained in Chapter 5. 

The boundaries between the grid regions require particular attention. Injec- 
tion (i.e., merely using the underlying fine-grid information without any weight- 
ing) is used to transfer information from the fine grid to the coarser regions. 
However, for communication in the direction from coarse to fine, a more compli- 
cated transfer is necessary. T nterpolation is the simplest way to fill in the fine- 
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grid boundary points. To maintain conservation at these intergrid boundaries, 
however, the coarse-grid interior points which lie on the fine-grid boundary must 
be corrected by performing a flux balance of the surrounding cells. This tech- 
nique has been used with some success in two dimensions [116], but a derivation 
of the corresponding 3-D implementation shows that it requires an inordinate 
amount of additional floating point operations, and, although applied only at a 
small percentage of the total number of grid points, does not appear to be cost 
effective. A third possible treatment of these points is to overlap the fine and 
coarse regions by more than one row or plane, so that the actual boundaries lie 
farther in the interior of neighboring regions. In the present work, interpolation 
has been used to transfer information in the coarse to fine direction. 


3.4 Zonal Application of Flow Solver 

Given a set of grid embeddings such as described in the previous section, a 
procedure has been developed by which the appropriate flow equations are solved 
in each region of the computational domain. The flow physics and the coarseness 
of the grid in each zone provide the guidelines for determining which set of equa- 
tions are appropriate. 

The full Reynolds-averaged Navier-Stokes equations are needed to resolve 
three-dimensional viscous effects, such as those arising in corners where the blade 
and endwall meet. The thin-layer equations suffice in regions where diffusion 
processes parallel to a body surface may be neglected (assuming that the grid is 
body-conforming). In grid zones far from viscous boundaries, where the flow is 
inviscid, the Euler equations are applied. Computational costs decrease (because 


51 


fewer terms need to be computed) as one moves through this hierarchy from full 
viscous to inviscid. Additionally, convective terms can be sufficiently resolved on 
a coarser discretization than, for instance, diffusive terms in the boundary layer. 
These two considerations are combined to reduce computational expense substan- 
tially. 

MacCormack’s method is applicable to all zones of the flow field. Since the 
convective coarse-grid implementation is used, the multigrid scheme may be 
applied without modification to both viscous and inviscid flow zones. 

A natural coupling of the zonal scheme to the use of embedded fine grids 
exists. Since high resolution is required for the full Navier-Stokes equations, it 
makes sense to solve them only in the finest grid regions. Thin-layer equations, 
applicable to viscous wall boundary regions, are solved on the appropriately- 
refined grids near surfaces. The inviscid Euler equations are well-suited for 
regions of the flow field away from surface boundaries where no refinements have 
been made to the grid. 


3.5 Algorithm Assembly 

The flow field updating begins with mesh 1, the finest embedding. After one 
time step on mesh 1, mesh 2 is updated exterior to mesh 1 while convergence 
acceleration is applied at the mesh- 2 points interior to mesh 1. Next, mesh S 
(the global coarse grid) is updated exterior to mesh 2 while convergence accelera- 
tion is applied at the mesh- S points interior to mesh 2. Updating proceeds in this 
fashion until the global mesh has been advanced one time step. Then, 
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convergence acceleration is applied to coarsenings of the global grid. The 
appropriate equations are solved in each region, as described in section 3.4, while 
convergence acceleration uses only the inviscid terms. This cycle is repeated until 
the desired measure of convergence is satisfied. 

Figure 3.10 shows an example, for a typical 2-D implementation, of the com- 
putational steps and grid regions involved in one cycle through the algorithm. 
Part o shows the grid regions. Where grid points/lines are shown, either the 
MacCormack scheme or multigrid is applied: MacCormack only in Part a; Mac- 
Cormack interior to the bold line (intergrid boundary) and multigrid exterior to 
the bold line in Parts 6 and e; and multigrid only in Part d. 

The numerical solution procedure described in this chapter combines an 
explicit scheme with several execution- and convergence-acceleration techniques 
to yield a robust solver with wide applicability to viscous and inviscid internal 
flows. It is particularly well-suited to parallel-processing supercomputers. Exten- 
sion to flows with more complicated physics and geometries is expected to further 
demonstrate the efficiency and flexibility of the procedure. Automated adaptive 
gridding, better zonal interface boundary treatment, and further investigation of 
artificial dissipation operators would also contribute to improving the scheme, 
and some of these additions and modifications are planned. The results of com- 
putational experimentations using this algorithm are presented in Chapter 5. 
Next, parallel processing considerations are discussed. 



4. Parallel Processing 


In an effort to build faster, more powerful computational engines, computer 
architects are increasingly incorporating parallelism in their designs. This is 
necessitated by the current development stage of scalar or von Neumann-type 
computers, in which components are approaching their technological performance 
limits [8]. To effectively use multiprocessing devices for the numerical simulation 
of aerodynamic flows, the scientist must understand both the architecture of 
these machines and the methods by which their parallelism is optimally 
exploited. Unfortunately, the development of software tools such as compilers 
and languages lags the pace of hardware innovations, so that elegant, portable 
expression of parallel algorithms is not yet possible. 

In this chapter, parallel computer hardware and software will be analyzed, 
and the impact of each on algorithmic considerations for concurrent computing 
will be illustrated. Details about current-generation supercomputers and the pro- 
gramming techniques employed to exercise their full capabilities, such as vectori- 
zation and multitasking, will be discussed. Typical formulae for evaluating 
parallel-processing performance will be presented. In concluding this chapter, we 
will look to the future by examining both next-generation supercomputers as well 
as some present-day advanced-architecture superminicomputers, which may 
determine the direction large-scale computing will take in the next decade. 
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4.1 General Considerations 

Parallelism can be achieved at many levels within a computer system, some 
of which are transparent to the user while others require a nontrivial program- 
ming effort for their effective use. With a sound understanding of architectural 
issues, available software tools can be intelligently implemented, and algorithm 
strategies can be devised to maximize the performance of the supercomputer. 

4.1.1 Parallel Computer Architectures 

Computers may be classified architecturally according to a taxonomy 
developed by Flynn [189]. A purely scalar computer, in which instructions are 
issued sequentially and operate on single data elements, is a single-instruction 
stream, single-data stream (SISD) machine. Vector and array processors, which 
apply single instructions to vectors or arrays of data, are classified single- 
instruction, multiple-data (SIMD) computers. Systems consisting of multiple pro- 
cessors which may operate independently of one another are referred to as 
multiple-instruction, multiple-data (MIMD) machines. SIMD computers either 
use vector instructions to operate on large data sets in a manner similar to an 
assembly line (or pipeline), as on a Cyber 205, or they distribute data across a 
large array of less-powerful processors which then execute instructions in lock- 
step, as on an MPP or Connection Machine. On MIMD computers, separate pro- 
cessors may execute different instructions simultaneously. 

Parallelism in computer architectures may be present at a number of levels 
[137,190], which complicates the seemingly straightforward classification scheme 
set forth by Flynn. At the highest level in a multiprocessor system, different jobs 
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(i.e., entire programs) may be executing at the same time on the various proces- 
sors. In a multitasking environment, individual programs may use more than one 
processor at once. At a lower level, instructions may be partitioned over several 
functional units, or the functional units themselves may have parallel or pipe- 
lined steps. At the lowest level, bit manipulation can be performed in parallel. 
Computer designs generally incorporate some subset of these ideas in hardware. 
In this work, we are primarily concerned with taking advantage of parallelism at 
the individual program level (i.e., the level which the user sees), referred to as 
multitasking. 

Several categories of parallel processing computers, including both S1MD and 
MIMD types, are in evidence among currently-available hardware. Key charac- 
teristics of these systems are: the number and power of central processing units 
(CPUs), the location of memory, and the nature of the connection scheme 
between the CPUs and memory [166]. Computers generally have either a few 
powerful processors or many weak ones, since the extreme cost of building a sys- 
tem with many powerful processors is prohibitive. An example of the former 
type of system is the Cray X-MP/48, with four vectorizing CPUs, each of which 
is more powerful than Cray’s previous supercomputer, the Cray-1. A machine 
such as the Connection Machine is representative of the "many weak" type, hav- 
ing as many as 2 1S single-bit processors which all execute the same instruction. 

The main (or primary) memory in a multiprocessor computer may either be 
distributed among processors (a loosely-coupled system) or shared, with equal 
accessibility, by all processors (a tightly-coupled system). With either type of 
memory implementation, additional levels of storage are generally present in the 
machine. Registers and c iche memory are relatively small in size, closely 
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associated with individual processors, and have the highest access speeds. Typi- 
cally cache memory, with an access time on the order of one clock cycle, is 4 to 
20 times faster than main memory [138]. A shared main memory may be parti- 
tioned into banks to improve the rate of access to data. Retrieving a string of 
data from many different banks can be done faster than accessing it all from a 
single bank (where memory cycle or refresh time would limit the frequency at 
which each datum can be obtained). The largest main memories currently range 
in size up to two gigabytes. Slower, larger storage devices such as an SSD 
(Solid-State Storage Device) are used to increase the memory capacity of the 
machine by as much as four additional gigabytes. These various levels of 
memory hierarchy are illustrated in Figure 4.1. 

Interprocessor communication is an important feature of any multiprocessor 
system. Some connection scheme possibilities include busses and switching net- 
works such as full crossbars and hypercubes [138]. A bus has the simplest topol- 
ogy for linking processors to memory but has the highest potential for becoming 
a system bottleneck. Insufficient bandwidth can result in contention for the bus, 
reducing system performance. A full crossbar switch, on the other hand, minim- 
izes the possibilities for contention but becomes increasingly complex as the 
number of processors grows. The crossbar directly connects every processor with 
every memory module, so that contentions for the interconnect are essentially 
nonexistent. For large numbers of processors the full crossbar switch is prohibi- 
tively expensive. While busses and crossbars lie at opposite ends of the spectrum 
of interconnection schemes, switching networks, such as the butterfly or perfect 
shuffle, and multiprocessors linked together in a hypercube lie somewhere in 
between. Processors in a hypercube are directly connected only to a subset of the 
other processors, and access the rest via message-passing (or other forms of 
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routing) through intermediate processors. Figure 4.2 shows three of the most 
common interconnection schemes: the bus, the full crossbar switch, and the 

hypercube. 

Although the cost of multiprocessor systems grows linearly with the number 
of CPUs (assuming the relative cost of the interconnect to be negligible, a very 
conservative assumption for this comparison), the performance grows more 
slowly. For a massively-parallel system to be cost effective, problem and program 
parallelism must be fully exploited and efficiently mapped onto the architecture 
of the machine. The software tools that will enable such implementations are 
still in early stages of development and will be discussed in sections 4.2 and 4.3. 

4.1.2 Adapting Algorithms to Parallel Architectures 

In a discussion of parallel processing applied at the program level, the term 
process or task refers to a unit of computation that may be executed in parallel 
with or independently of one or more other sections of code. In general, these 
separately-running processes need to exchange information among themselves, 
which leads to synchronization concerns, communication management, and over- 
head costs. The effects of load balancing, granularity, overhead, and Amdahl’s 
law are all important factors affecting parallel-processing performance. 

To analyze the performance of a computer having two distinct speeds of exe- 
cution, Amdahl [140] developed the model 

*(/) = — 

(1 /) / R H + / / R>i 
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where the resulting system rate, R(f), is determined by the fraction of results (1- 
f) generated at the high rate, Rjj , and the remaining fraction, /, generated at the 
low rate, R L . For a multiple-processor system, the high rate is simply the 
number of processors, P, multiplied by the low rate. Amdahl’s law is then writ- 
ten 




( i -/)//> + / 

If the execution time, T, is defined as the number of results generated at a 
given execution rate, and speedup ( S p ) is defined as the ratio of execution time 
on a single processor ( TJ to that on the multiprocessor system ( T p ), then 

s m h. m b \ n _ i 

' T p R l (1 -f)/P + f 

From this relation, it can be seen that the fraction of residual sequential code, f, 
limits possible speedup as P gets large. This equation is illustrated graphically in 
Figure 4.3. 

Worlton [141] uses Amdahl’s law to examine the effects of overhead, granu- 
larity, and synchronization requirements. In his interpretation of the model, N, 
t, t g , and t Q represent, respectively, the number of tasks between synchronization 
points, average task execution time (granularity), and task overhead time due to 
parallel execution. The single-processor execution time is then 

T 1 = Nt 
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Assume, for simplicity, that the number of tasks is evenly divisible by the 
number of processors and the tasks are all equal in length (i.e., assuming a per- 
fect load balance), so that each task takes (t + tj time to execute, and t f time is 
required for synchronization. The total parallel execution time is then 

m + g 

p 

The speedup for this case is given by 

T i Nt 

S P 

T p t, + N(t + t 0 )/P 

The best possible speedup on P processors is obviously equal to P. (This 
excludes the case of superlinear speedup, which, while plausible for certain algo- 
rithms such as tree searching, does not appear to have applicability to the 
numerical solution of partial differential equations in fluid dynamics.) 

Efficiency is defined as 

1 

(t.P I Nt) + (1 + t, / I) 

so that one hundred percent efficiency is equivalent to the maximum speedup, P, 
attainable on P processors. 

If we neglect synchronization time to examine the relative effects of granular- 
ity and overhead size, the equation for efficiency can be rewritten 
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1 

E p = 

(1 + *. I 0 

so that the ratio of overhead to granularity must be very small for good 
efficiency, or, in other words, task granularity must be much larger than task 
overhead. 

If, on the other hand, we examine the effect of synchronization time while 
neglecting the task overhead, the equation becomes 

1 

£J = 

P ( Pt t / Nt + 1) 

so that synchronization overhead is minimized by increasing task length. 

Several conclusions can be drawn from the above model. The performance of 
a parallel-processing computer is constrained by the fraction of the work that 
must be done in sequential mode, which limits the attainable speedup as the 
number of processors is increased. This fraction of sequential code can result 
from the mathematical model or the algorithm [142]. Synchronization and task 
overhead effects can be reduced by increasing the granularity of tasks, while 
better load balancing is achieved by having the number of tasks be equally divisi- 
ble by the number of processors and all of equal length. 

There are several ways in which an algorithm may be partitioned. Functional 
decomposition of a problem is appropriate when physical processes are evolving 
on different length or time scales. An example of this is the parallel coarse-grid 
scheme, in which the various grid levels have been decoupled so that they may be 
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computed concurrently. Another functional decomposition can be achieved by 
distributing each equation of a system of partial differential equations to a 
different processor. Spatial decomposition is appropriate for flow solvers in which 
groups of points can be viewed as being independent of one another on a given 
time level. Line implicit and explicit schemes for solving partial differential equa- 
tions fall under this category. It may even be possible to formulate temporal or 
phase decomposition schemes where, as the solver finishes a time step on one part 
of the grid, the next time step could be started there, while the computation of 
the previous time step is being completed elsewhere in the field. These tech- 
niques may be combined to create an optimal partitioning scheme for a particu- 
lar parallel-processing architecture. 

Exploitation of the parallel-processing capabilities of a system requires either 
extensions to an existing language such as Fortran or development of an entirely 
new language. Merely relying on compiler sophistication is insufficient to harness 
the full power of the hardware for most applications. Although language exten- 
sions are necessary for the short term (and very common among computer ven- 
dors), their diversity on various systems makes it impossible to write transport- 
able parallel code. No consensus has been reached thus far regarding the form 
multitasking (or, for that matter, vectorization) should take. This is evident in 
some of the examples in the following two sections. 


4.2 Vectorization 

Vector computers achieve higher performance than scalar or sequential com- 
puters by pipelining identic? 1, yet independent, DO loop level operations on large 
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data sets. These data sets, when stored in some regular arrangement in memory, 
are referred to as vectors, and the pipelining technique is called vectorization. All 
current-generation supercomputers used in the present work have this capability. 

Vectorization has as its goal the production of at least one result per CPU 
clock cycle per vector pipeline. Since most operations take some multiple of 
clock cycles to complete, functional units are divided into single-cycle stages and 
operations performed in an assembly-line fashion to achieve this end. This also 
implies that data must be fetched from memory at the rate of at least one or two 
operands per clock cycle and be stored at the same rate. This is often accom- 
plished by partitioning the memory into phased banks, such that successive 
banks may be accessed one cycle apart for vector operations. 

Vectorization is performed at the DO loop level in Fortran. Loops with a reg- 
ular (i.e., constant) stride are most amenable to this technique. Some barriers to 
the vectorization include [137]: conditional and branch statements, recursive 
computations, nonlinear and indirect addressing, and subroutine calls within 
loops. In order for a compiler to automatically recognize vectorizable loops, 
moderate to extensive code restructuring is usually required, ranging from simple 
reordering of array indices or lines of Fortran, to implementation of explicit 
gather /scatter instructions to create the vectors, to algorithm modification to 
eliminate recursive computations. 

Vectorizing Fortran code for the Cyber 205 involves the following. First, 
array indices are ordered such that the dimension of the leftmost index is largest 
and the rightmost is smallest. Likewise, DO loops are arranged such that the 
innermost corresponds to the first (leftmost) index, etc. These loops can then be 
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vectorized by the 205 compiler if the data are stored contiguously, the loops have 
stride one, and none of the aforementioned barriers is present in the loop. For 
cases where the data are not stored contiguously, or the loops have a nonunit 
stride, explicit vector Fortran syntax must be employed to achieve vectorization 
[191]. (At this point the code loses its portability.) For loops computing values 
at a high percentage of the locations in a vector or array, the WHERE block con- 
struct is appropriate. A bit vector mask with length equal to that of the operand 
vectors is created, containing ones denoting locations where new values are 
desired and zeroes which mask out other locations. Computations are carried out 
for the entire vector, and then results are stored only in locations corresponding 
to the one bits in the masking vector. Vector intrinsic functions may be used to 
gather or compress data into contiguous memory locations when relatively few of 
the vector or array values are to be modified. After the vector operation, the 
results are either scattered or expanded and merged into appropriate storage 
locations. 

The techniques used to facilitate automatic vectorization on the Cyber 205 
are also applicable on Cray computers. Additionally, the Cray compiler is capa- 
ble of vectorizing constant, nonunit stride DO loops. Directives may be inserted 
in the code, without destroying its portability, to instruct the compiler to ignore 
apparent dependencies in order to vectorize computations with complicated array 
indices. 

The following are examples of vectorization techniques and implementation 
taken directly from the three-dimensional code used for the present research. 
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The basic flow solver, MacCormack’s explicit predictor - corrector scheme, 
has as its computational kernel a set of unit stride nested DO loops, the easiest 
and most efficiently vectorized construct for the Cyber 205. Note the ordering of 
indices and loops. 


DO 200 KE — 1, 6 
DO «00 JY = 2, NM 
DO >00 KZ = 2, LM 
DO tOO IX — 2, MM 

DQ(IX,KZ,JY,KE) *■ 0.6E0 • ( -DTAU(IX,KZ,JY,1) • 

* ( (FCQ(EX,KZ,JY,KE) - FCQ(IX-1,KZ,JY,KE) ) 

* + (GCQ(EX,KZ,JY,KE) - GCQ(IX,KZ,JY-1,KE) ) 

* + (HCQ(IX,KZ,JY,KE) - HCQ(IX,KZ-1,JY,KE) ) 

* - (SSQ(EX,KZ,JY,KE) - SSQ(DC,KZ-1,JY,KE) ) ) 

* + DELQ(IX,KZ,JY ,KE) ) 

QQ(IX,KZ,JY,KE) = QQ(EX,KZ,JY,KE) + DQ(IX,KZ,JY,KE) 
tOO CONTINUE 


Since the above equations are applied only to the interior of the computa- 
tional domain, thus creating gaps in an otherwise contiguous storage arrange- 
ment (i.e., at the boundary locations), only the innermost loop vectorizes 
automatically. In order to create a computation where the vector length is on 
the order of the number of points in the domain (i.e., m n ) instead of just the 
dimension in one direction (order (m)), the WHERE block control structure pro- 
vided on the Cyber 205 is employed. A bit mask vector is used to control the 
updating of data, limiting it to the interior, while the computation was carried 
out for all points in the domain. The syntax follows. 
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DESCRIPTOR BCORD 
BIT BCORD 
(assign BCORD) 

DO S00 KE»I,( 

WHERE ( BCORD ) 

DQ(2,2,2,KE) ILC) — 0.6 • ( -DTAU(2,2,2,1» ILC) * 

* ( (FCQ(2,2,2,KEj ILC) - FCQ(1,2,2,KE } ILC) ) 

* + (GCQ(2,2,2,KEj ILC) - GCQ(2,2,1,KE» ILC) ) 

* + (HCQ(2,2,2,KE, ILC) - HCQ(2,1,2,KE| ILC) ) 

* - (SSQ(2,2,2,KE| ILC) - SSQ(2,1,2,KE| ILC) ) ) 

* + DELQ(2,2,2,KE> ILC) ) 

QQ(2,2,2,KE| ILC) « QQ(2,2,2,KEj ILC) + DQ(2,2,2,KE» ILC) 
ENDWHERE 
200 CONTINUE 


The WHERE block construct is too inefficient for the multigrid computa- 
tions, which are performed at less than 25 percent of the grid points in the 
three-dimensional case. Gather and scatter instructions are more appropriate for 
such a sparse computation. Their use is shown here. 


DO 800 KE as 1, 5 

VQQ((KE-l)*IPl+lj IP1) — Q8VCMPRS (QQ(1,1,1,KE) IMLN), 

* BCG1((IGD-2)*IMLN+H IMLN), VQQ((KE-l)*IPl+lj IP1)) 

800 CONTINUE 

(multigrid computations with VQQ) 

DO 900 KE » 1, 5 

VDQ8((KE-1)*IMLN+1) IMLN) = Q8VXPND (VDQ2((KE-l)*IPl+l, IP2), 

* BCG2((IGD-2)*IMLN+lj IMLN)j VDQ8((KE-1)*IMLN+1 } IMLN)) 

900 CONTINUE 

DO 910 KE — 1, 6 

DQ(1,1,1,KE) IMLN) = Q8VCTRL (VDQ8((KE-l)*IMLN+lj IMLN), 

* BCG9((IGD-2)*IMLN +lj IMLN)} DQ(1,1,1,KE| IMLN)) 

910 CONTINUE 


On the Cray, the prolongation step of the multigrid algorithm contains no 
barriers to vectorization, but has too many complicated variable indices for the 
compiler to decode and analyze for dependencies, so it fails to vectorize. By 
inserting a directive [192] the programmer can force the compiler to vectorize the 
code in spite of the apparent vector dependencies. 
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INTL — IGD - 1 
DO 7*0 KZ — KKB, KKE, IGS 
DO 7*0 I — UB, DE, IGS 
IIVL — IGS 
IGJ — S • IGS 
DO 720 m — 1, INTL 
nvL — nvL/a 
IGJ — IGJ/2 
JEE — N - IGJ 
DO 720 J — 1, JEE, IGJ 
CDIRI IVDEP 

DO 720 K — 1, 6 

DQ(I,KZ,J+IIVL,K) = 0.60 * (DQ(I,KZ,J,K) + DQ(I,KZ,J+IGJ,K» 
720 CONTINUE 


The parallel coarse-grid algorithm has been vectorized in a similar fashion, 
with the additional feature of combining the points to be updated on grids 2 
through N into one long vector. This minimizes the vector startup overhead and 
thus further improves the speed of the algorithm on an SIMD computer. The 
performance of the vectorized code is presented in Chapter 5 for both two- and 
three-dimensional test cases. 


4.3 Multitasking 

As stated previously, parallel-processing involves a collection of separately- 
running processes, called tasks, running on multiple CPUs. Multitasking, a term 
popularized by Cray, describes the technique by which software tools are used to 
manage multiple processors operating on a single program. Creation, synchroni- 
zation, and termination of processes, and the communication among them are the 
primary functions needed to create and execute parallel programs. These capabil- 
ities can be provided either by extensions to the Fortran language or by the 
inclusion of precompiler directives on the Cray systems used in the present 
research. 
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To properly multitask an applications code, an analysis must be performed to 
determine data dependencies, parallel structures, and shared variables present in 
the solution procedure. Data dependencies arise in situations where one process 
reads shared data that another process writes. The dependency is satisfied when 
the latter process has completed writing the data. 

A task graph (also called a data dependency graph) can provide a way of 
visualizing the task and data interdependencies. It is constructed by defining 
parallel structures or processes and representing them as a graphical objects, and 
then connecting these objects with lines depicting the data flow in the program. 
A sample task graph, describing the three-dimensional multigrid MacCormack 
scheme, is presented in Figure 4.4. Large-granularity tasks have been con- 
structed to minimize the overhead incurred by library calls used in multitasking. 
The most important task groups to note in the figure are those corresponding to 
the basic fine grid scheme and the parallel multigrid scheme, examples of spatial 
and functional decomposition, respectively, and the data dependencies. 

Extensions to programming languages which allow the creation and termina- 
tion of processes, synchronization of processes, and communication among them 
have been developed for multiprocessing on Cray supercomputers. Both the 
Cray-2 and the Cray X-MP have software libraries of multitasking tools. Two 
variations of multitasking are available on the X-MP, namely, macrotasking [193] 
and microtasking [150]. Macrotasking is intended for application to large-grained 
problem partitioning, while microtasking, by virtue of its very low overhead, may 
also be used efficiently at a fine-grained level. Microtasking has only been 
recently developed for the Cray-2. A careful examination of the problem is neces- 
sary in order to define large code structures suitable for parallel execution when 
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using macrotasking. Microtasking, on the other hand, is easily implemented in a 
code which has been optimized for vectorization as well as a code which has been 
previously macrotasked. Neither of these tools results in code which can be run 
in parallel mode on any other system, though. For this reason, other extensions 
to Fortran for use on parallel machines have been developed. The SCHEDULE 
package [168] and CoFortran [167] language extensions have been written to 
enable the execution of parallel algorithms on a variety of shared-memory MIMD 
systems without code modification. The library routine calls used by each are 
included, along with the Cray multitasking tools, in Appendix E. 

The sequential multigrid algorithm contains many opportunities for creating 
small granularity parallelism but relatively few opportunities for the sort of large 
granularity necessary to produce good multitasking speedup in the face of non- 
trivial multitasking overhead. This observation, together with the desirability of 
non-sequential multigrid schemes for reasons of algorithm flexibility, led to the 
construction of the two-dimensional parallel multigrid algorithm described in 
Chapter 3. In this algorithm, grids which are independent of one another may be 
updated simultaneously on separate processors. In fact, such a simple strategy 
may result in a poor load balance across processors because of the different 
amounts of work inherent in updating grids of different coarseness. However, 
more refined strategies are possible. Grids may, for example, be grouped 
together into tasks of approximately equal work, or they may be melded into 
tasks with other large-grained multitaskable code segments in order to equilibrate 
processor loading. Notice further that, by multitasking large-grained structures, 
the vectorization potential of code within these structures remains intact. 
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Given that MacCormack’s method and the multigrid scheme are both expli- 
cit, parallel processing may also be employed without drastic algorithmic 
modifications. In the three-dimensional case, spatial decomposition is the method 
by which parallelizable tasks are created. The computational domain, a grid 
with dimensions 129x33x33, is partitioned by planar slices in the lengthwise 
channel direction, as shown in Figure 4.5. This direction is chosen so as to 
preserve vectorization potential in the inner DO loops, which execute along lines 
parallel to the partition boundaries (the i-direction). Some overhead is intro- 
duced to enable flexible task management through variably-sized partitioning. 
Redundancy is also added to the computational scheme along the boundaries 
between tasks to eliminate the need to protect shared data from being updated 
more than once per time step in those regions. 

Detailed results of performance studies using macrotasking and microtasking 
are presented in Chapter 5. 


4.4 Performance Evaluation 

The absence of a good general means of quantifying parallel-processing super- 
computer performance is an area of concern for the computational scientist. The 
peak megaflop ratings of a system are of little worth unless one is interested 
solely in matrix multiplications (or whatever the peak-speed operation of a par- 
ticular machine may be). However, there are a few basic measurements that can 
be made to determine at least the relative improvement in execution time and 
efficiency obtained by vectorization and multitasking. For single-processor vector 
computers, speedup is the comparison between CPU time for scalar versus vector 
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modes. For computers with multiple CPUs, speedup quantifies the relative wall 
clock time for uniprocessor versus multiprocessor execution on a dedicated 
machine. A general equation for speedup is then 


S P= T 1 / T P 

where T represents time, and the subscript P refers to the number of processors. 

Given this definition, we can also find the efficiency resulting from multiple 
processing, simply by dividing the speedup by the number of processors 



The efficiency is not so clearly definable on vector processors as it is a function of 
startup time, vector length, and vector pipe length. 

If the percentage of sequential code is known, the theoretical speedup can be 
calculated from Amdahl’s law and compared to the actual speedup measured on 
the system. This will quantify the amount of overhead due to task startup and 
synchronization costs. 

Another criterion for performance evaluation is the number of floating-point 
operations per second (usually expressed in megaflops/second) executed by the 
computer on a given application. The peak Mflop rate, seldom attained by the 
user, is not a particularly useful quantity. Achieving peak speed requires keeping 
all functional units on all available processors busy without pausing for any syn- 
chronization or communication. 
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The desire to determine a more realistic (i.e., sustainable) Mflop rating leads 
to the use of benchmarks, or suites of programs which are meant to (hopefully) 
represent a typical workload on a computer. Benchmarking codes range from 
simple matrix manipulations, to compute-intensive kernels extracted from appli- 
cations codes, to full-blown numerical simulations. Widely-used benchmark 
suites include UNPACK [194], the Livermore Loops [195], and the NAS Kernels 
[196]. When benchmarking a parallel computer, these programs must often be 
recoded to include parallel Fortran extensions or rewritten in an entirely different 
language in order to be able to measure the multiprocessing system performance. 

The study of parallel-processing efficiency raises a peripheral issue which is 
sometimes used as an argument against applying multiple processors to a single 
application. Multitasking reduces system throughput by introducing additional 
overhead and synchronization costs (since, in general, parallelization results in 
less than 100% efficiency) which are not incurred by a uniprocessor run. How- 
ever, in order to advance the frontiers of computational science and to solve 
problems that are not tractable on single-CPU systems because of prohibitively 
long execution times, parallel-processing must be exercised [197]. A study by 
Bieterman [198] has actually shown that throughput can even be improved by 
multitasking on a system that is constrained by memory size. 


4.5 Current-Generation Supercomputers 

The following are classified as current-generation supercomputers: Cyber 

205, Cray X-MP, Cray-2, Fujitsu VP- 100/200/400, Hitachi S810/820, and NEC 
SX2/A2. A supercomputer is defined as one of the fastest computers presently 
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available, a floating definition. The characteristics of the Japanese supercomput- 
ers [199] (the Fujitsu, Hitachi, and NEC machines) are summarized in Table 4.1, 
and are included in this section for completeness, although no calculations were 
performed on any of these systems. All are vector SIMD computers. 

The three machines to be discussed in detail in this section, the Cyber 205, 
the Cray X-MP, and Cray-2, have all been used extensively in carrying out the 
current research. Their important features are recorded in Table 4.2. 

4.5.1 Cyber 205 

The Control Data Cyber 205 [200] is a single CPU vector-processing super- 
computer. It was first commercially marketed in 1981, and can be configured 
with one, two, or four vector pipes and one to 32 million 64-bit words of main 
memory. The 205 has a CPU clock cycle time of 20 nanoseconds and memory 
access cycle time of 80 nanoseconds. Vector operations are performed by fetching 
data directly from memory and returning the result back to memory, without the 
use of any intermediate registers. These operations require that vector elements 
be stored in contiguous locations. The virtual memory of the machine allows 
paging of blocks of memory to and from a secondary storage level, which allows 
programs that use more than the available main memory of the machine to be 
executed without explicit user intervention. 

In order to be able to fully exploit the vector-processing capability of the 
Cyber 205, CDC has provided enhancements to the Fortran language in the form 
of explicit vector syntax [191]. Also furnished are language extensions that permit 
the efficient management of the paging that arises in the use of virtual memory. 
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Some of the advantages of the 205 relative to other supercomputers include 
its bit manipulation capabilities, use of virtual memory, and good performance 
for applications using long vector computations. However, the 205 is relatively 
inefficient for short vector operations, and the need to manage virtual memory to 
prevent thrashing can become a nontrivial task. Additionally, creating long vec- 
tors with elements in contiguous memory locations generally requires the use of 
nonstandard Fortran syntax which adds significant effort to the development of a 
code. Results of vectorized two- and three-dimensional versions of the multigrid 
algorithm, presented in Chapter 5, illustrate the relative merits of relying on the 
compiler versus investing a great deal of individual effort into enhancing perfor- 
mance. 


4.5.2 Cray X-MP 

The Cray X-MP [201] can be configured with one, two, or four vector- 
processing CPUs and between two and sixteen million 64-bit words of shared 
main memory. The four-processor X-MP/4 can have four to sixteen Mwords of 
ECL (emitter-coupled logic) bipolar memory, with an access cycle time of 34 
nanoseconds, while the main memories of the X-MP/1 and 2 systems use MOS 
(metal oxide semiconductor) technology, with an access cycle time of 68 
nanoseconds. The X-MP/1 and 2 were introduced in 1982 with a CPU clock cycle 
time of 9.5 nanoseconds, and were followed in 1984 by a four-processor system. 
An 8.5 nanosecond clock machine has also been developed, and lately an 
extended- architecture (EA) version of the X-MP, which allows the use of a larger 
main memory, has been introduced. 
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To vectorize computations, the Cray moves data from main memory into 
registers before performing the calculations, in comparison with the direct 
memory-to-memory mode for vector operations on the Cyber 205. Each X-MP 
CPU has four parallel memory ports: two for vector fetches, one for vector 
stores, and one for independent I/O operations [201] to provide high bandwidth 
between memory and processors. A cluster of registers in the system is dedicated 
to the sole purpose of efficient interprocessor communication and control. 

Vectorization of constant-stride DO loops is done automatically by the com- 
piler, a process which can be either aided or inhibited by user intervention in the 
form of compiler directives. Parallel-processing is achieved through macrotasking 
or microtasking as is discussed in section 4.3. In the former, extensions to the 
Fortran language in the form of multitasking library calls are used, while the 
latter is accomplished through the use of preprocessor directives. 

The X-MP has the following advantages over other supercomputers. Its 
optimal vector length is 64, so that relatively short vectors perform just as well 
as long ones. It has up to four processors which gives it potentially four times 
the single-CPU performance. Microtasking is provided as a low overhead, easily- 
implemented tool for parallel processing. The requirement of contiguous storage 
of vector values is overcome by the use of vector registers. On the other hand, 
main memory size is not exceedable, as there is no virtual memory, which res- 
tricts problem size. The use of an SSD (Solid-State Storage Device) somewhat 
alleviates this restriction, although its use is not transparent to the user. The 
macrotasking version of multitasking has relatively high overhead and requires 
fairly extensive code restructuring for its efficient use. 
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4.5.3 Cray-2 

The Cray-2 [202], introduced in 1985, has the largest shared main memory of 
any supercomputer (268 million 64-bit words) and the fastest CPU clock speed 
(4.1 nanoseconds). It is a four-processor system, cooled by immersion in an inert 
liquid. Each processor has a peak speed of 488 Mflops, with the total system 
rated at 1.9 Gflops. The processors share one memory port, and main storage is 
configured in 128 banks. 

The large primary memory size of the Cray-2 is its main advantage over 
other current-generation supercomputers. It also possesses a small cache memory 
which has much faster access than main memory. The UniCos operating system 
is also considered by many to be a favorable characteristic. Although it possesses 
very fast CPUs, instructions are issued only once every two clock cycles, and for 
most applications its computational speed is equivalent to that of the Cray X-MP 
on a per-CPU basis. Multitasking is currently less efficient as implemented on 
the Cray-2 than on the X-MP due to the different hardware designs of the 
machines, the most notable of which is the single path from CPUs to main 
memory on the Cray-2. The large memory has a fairly slow access cycle time, on 
the order of 80-120 nanoseconds, which generally leads to a higher incidence of 
memory bank conflicts in many computer codes, impacting both single- and mul- 
tiprocessing execution rates. 
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4.6 Next-Generation Supercomputers 

The important (relevant) characteristics of the supercomputers which 
comprise the "next generation,* that is, those having a significant performance 
increase over currently-available machines, are summarized in Table 4.3 [203]. 

Cray has recently delivered its first Y-MP [204], a follow-on to the X-MP 
line. The Y-MP initially has a maximum of eight processors with 6 nanosecond 
clock cycles, with one instruction issue per clock cycle. Peak speed is expected to 
be 2.5 gigaflops, with capability of one gigaflop sustained. It will have 32 million 
64-bit words of main memory with 30 nanosecond access time, plus up to 512 
megawords of auxiliary memory in the form of an SSD. Main memory size may 
increase in later models as higher density chips become available. In preliminary 
benchmark tests on the Y-MP, the NAS kernels [196] have attained twice the 
speed of a single CPU of the X-MP. 

The Cray-3, successor to the Cray-2, will have 16 processors, each of which 
will be two to three times faster than a single Cray-2 processor [205]. This high 
performance will be partially attributable to gallium arsenide technology used in 
the processor chips to achieve a 2 nanosecond CPU clock cycle time. Peak speed 
is purported to reach 16 gigaflops. The Cray-3 will have 512 million 64-bit words 
of fast static RAM as main memory. Initial deliveries are expected as early as 
1990. 

The ETA-10, successor to the CDC Cyber 205, is available with as many as 
eight processors, each up to 2.5 times as fast as the 205 [206], Several systems 
have been delivered. The fastest eight-processor system is purported to have a 
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CPU clock cycle of 7 nanoseconds, with liquid-nitrogen cooling, and with four 
megawords of memory local to each processor and a 256-megaword shared secon- 
dary memory. Peak speed of the full-blown system in 64-bit mode may reach 
four gigaflops. 

The Japanese supercomputer vendors also have plans for next-generation 
supercomputers. Hitachi will have a single-processor two gigafiop system and 
NEC is planning a multiprocessor to attain peak speeds in excess of twenty 
gigaflops for 1989 [203]. 


4.7 Advanced-Architecture Machines 

While supercomputer designers are cautiously moving toward more highly- 
parallel systems, many of the superminicomputer vendors are already supplying 
machines with massive parallelism. Although individual processors of these sys- 
tems are not nearly as powerful as, for instance, a single CPU of the Cray-2, the 
fact that they can be focused in large numbers on a single computation has great 
potential. Their usefulness in large-scale scientific computing depends on the suc- 
cess with which applications and algorithms can be efficiently partitioned and 
parallelized. 

Three types of advanced-architecture minisupercomputers are briefly 
described here. The first example is a distributed-memory, SIMD system. The 
Connection Machine 2 [207] consists of a massive array of bit serial processors 
(up to 2 1# ) connected in a hypercube topology. Each processor has its own 
memory (on the order of .°»2Kbits). Instructions are issued from a front end 
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computer such that all processors operate in lockstep. This computer has the 
potential for speeds on the order of several gigaflops, and promising results for a 
CFD application have already been obtained [163]. 

A second class of architecture is the shared-memory, MIMD system. The 
Alliant FX/8 falls under this category [208]. Although the X-MP and Cray-2 are 
also of this type, the less-powerful minisupercomputer is distinguished by its 
more sophisticated software environment and lower cost. The Alliant’s Fortran 
compiler automatically multitasks code in much the same way as the Crays vec- 
torize, with minimal interaction from the programmer taking the form of com- 
piler directives. Its performance on selected loops extracted from CFD applica- 
tions ranges from 15 to 23 Mflops on an eight-processor system [161]. 

The majority of distributed-memory MIMD computers (the third example of 
a parallel computer architecture) can be categorized as "hypercubes." The 
hypercube class of multiprocessors originated with the Cosmic Cube project at 
Caltech [209]. Current commercial systems include the Intel iPSC/2 [210], the 
Ametek 2010 [211], and the NCube [212]. Bassett and Catherasoo [154] have 
demonstrated reasonable performance of the Ametek system on one of Jameson’s 
"FLO" codes [177], and a CFD package has developed for the iPSC [210]. 

The most notable work to date for a massively-parallel computer, however, 
has been carried out by Gustafson, Montry, and Benner [152] on the Ncube. 
Using a 1024-processor configuration, they were able to attain measured speedups 
greater than 400 with three applications codes, including a flux-corrected tran- 
sport CFD algorithm. 
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Also proposed in Ref. 152 is an alternative performance measure for parallel, 
distributed-memory computers. Determining the actual speedup is restricted by 
the limited memory available on any one processor of a massively parallel system. 
That is, the largest problem for which performance can actually be measured (by 
comparing T and Tp) is determined by the size of problem that fits on a single 
node. Gustafson et al. claim that as more processors are applied to a computa- 
tion, problem size should grow correspondingly (i.e., the computational scientist 
will want to use all available resources). They postulate that the serial portions 
of an algorithm remain fixed in size under such a scenario, and hence the effect of 
residual sequential code on speedup is not as strong as Amdahl’s law would 
predict. 

As more and more computational fluid dynamics applications are restruc- 
tured and implemented on advanced-architecture superminicomputers, a variety 
of the above-mentioned classes of parallel-processors may emerge as the most 
appropriate for certain classes of simulations. Investigations in this area can help 
determine the most promising directions for future supercomputer design. The 
many factors contributing to the performance of these machines must be care- 
fully considered and weighed against one another in order to provide the max- 
imum possible computing resource to the scientist. In the following chapter, the 
results of the current investigation implementing a complex CFD algorithm on 
shared-memory S1MD and MIMD systems are presented. 


5. Computational Results 


Given the solution procedure and parallel-processing considerations 
described herein, a set of model problems has been generated to test the 
efficiency and robustness of the algorithm. They are representative of internal 
flow problems, such as those which arise in turbomachinery blade rows and 
wind tunnels. 


5.1 Problem Specification 

For two-dimensional computational experiments, the basic geometry consists 
of a cascade of bicircular arc blades at zero angle of attack and no stagger. The 
three-dimensional model contains a bicircular arc blade mounted between 
endwalls at variable sweep angle. These problems allow the investigation of 
accuracy, stability, and performance of the algorithm without presenting 
unnecessarily complex grid generation requirements. 

These particular geometries have been selected because internal flow prob- 
lems often present more of a challenge to the computational fluid dynamicist 
than external flow problems. This is evidenced by the lag in the development 
of internal flow codes compared to those for external applications and is due in 
part to the more restrictive geometries and boundary conditions. The tran- 
sients in an external flow problem are typically allowed to radiate to infinity in 
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all or almost all directions, while the transients in an internal flow problem are 
constrained to radiate out small inlets and exits. Gridding is also more difficult 
for internal flows because the stretching is more complex, due to the need to 
resolve flow near a number of surfaces and also to have good resolution in the 
interior of the domain. 

The two-dimensional computational domains for both viscous and inviscid 
problems are illustrated in Figure 5.1. A bicircular arc blade is located one 
chord length from both the inlet and the exit. Channel height is also equal to 
one chord length. Symmetry permits limiting the computational domain to 
that shown by the dashed lines in the figure. 

Physical boundary conditions for two-dimensional inviscid flow cases are 
also included in Figure 5.1. At the inlet, total temperature, total pressure, and 
flow direction are specified. The flow tangency condition is enforced at the 
upper and lower boundaries. At the exit, static pressure is fixed. 

For two-dimensional viscous flow computations, the geometry of the 
domain is the same as that used for the inviscid cases. The problem of flow 
past a sting-mounted bicircular-arc airfoil is constructed by imposing the 
appropriate viscous boundary conditions. Inlet, exit, and upper wall boundaries 
have the same conditions as the inviscid case. Along the lower boundary, how- 
ever, the airfoil and sting are treated as viscous walls at which the no-slip con- 
dition and temperature or temperature gradient are specified. 

The computational domains used for the three-dimensional model problem 
are presented in Figure 5.2. As in 2-D, the domain is intended to represent the 
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geometry of a turbomachinery blade row or cascade. The sweep angle of the 
blade is permitted to vary from 0 to 26 degrees relative to the normal of the 
flow direction. The blade, at 0 degrees sweep (i.e., perpendicular to the incom- 
ing flow direction), is positioned one chord downstream from the inlet with its 
trailing edge one and a half chords upstream of the exit. This positioning 
ensures that, as the sweep angle is increased to 26 degrees, the blade will remain 
at least one chord upstream of the exit. The channel height and width are each 
equal to one chord length. 

Boundary conditions for the three-dimensional test problems are also shown 
in Figure 5.2. For inviscid flow, the inlet total temperature, total pressure, and 
the angle of the incoming flow are fixed. At the exit, static pressure is fixed. 
Along the front, back, top, and bottom walls, the tangency condition is 
imposed. 

For the viscous problem, the inlet and exit conditions remain the same, 
while the front and back boundaries are treated as viscous walls (representing 
the endwalls to which the blade is attached in the cascade), and, as such, the 
no-slip condition and temperature or temperature gradient are specified there. 
Along the upper boundary flow tangency is imposed. Along the blade surface 
and downstream along the lower boundary to the exit, the viscous wall condi- 
tions of no-slip and the temperature or temperature gradient are specified. 

3 

The Reynolds number of the flows computed span the range from 8.4x10 to 

2.0xl0 3 * S based on cascade gap and critical speed. Airfoil thickness is varied 
from 10% of the channel height to 0% (which represents a flat plate). The 
angle of attack in all cases is zero. 
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5.2 Grid Definition 

A fairly coarse global mesh is used as the base grid, and all grid regions are 
supersets of it. It is generated algebraically. The two-dimensional mesh con- 
tains lateral grid lines which are smoothly stretched away from the lower wall 
boundary in a geometric progression. Transverse grid lines are equispaced over 
the blade and then stretched away from the leading and trailing edges to the 
inlet and exit, respectively. Typical 2-D grids are illustrated in Figure 5.3. 
Although the figure shows either 65x17 or 129x33 points on the grids, computa- 
tions have also been performed on grids dimensioned with 65x33 points. Initial 
spacing in the normal direction near walls is varied as appropriate for viscous 
and inviscid wall boundary conditions and flow cases. 

The three-dimensional grids are constructed by simply stacking the two- 
dimensional grid planes. These planes are clustered near the front and back 
walls of the computational domain. Figure 5.4 shows the inlet, lower- and 
rear-wall grids for typical three-dimensional inviscid (airfoil) and viscous (flat 
plate) problems. Calculations have been carried out on meshes dimensioned 
65x9x17 (the largest that would fit in main memory on the Cyber 205) and 
65x17x17 and 129x33x33 (on the Cray X-MP and Cray-2). The latter two are 
displayed in Figure 5.4. Using this notation, the first dimension represents the 
number of points in the lengthwise or x-direction, the second, the transverse or 
^-direction, and the third, the vertical or y-direction. 
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5.3 Algorithm Performance 

Numerous computational experiments have been carried out for both two- 
and three-dimensional test cases to measure the robustness and the speed of the 
various components of the algorithm presented here. The data obtained enable 
an examination of accuracy, stability, convergence behavior, and vectorization 
and multitasking efficiency. In addition to testing the performance of each ele- 
ment of the solution procedure individually, the fully-integrated multitasked, 
vectorized, embedded zonal multigrid three-dimensional scheme has been tested 
on the Cray-2. Results are presented here in the form of flow visualizations, 
discussions of accuracy and stability, convergence comparisons, and tabulated 
speedup and efficiency measurements. 

5.3.1 Flow Results 

Since the primary goal of this research is to improve the speed at which a 
given three-dimensional flow solution can be computed, one measure of success 
is that all flow results should be independent of the variant of algorithm being 
applied. Any discrepancies must be resolved. With this in mind, typical results 
for two- and three-dimensional cases are presented as a baseline. 

Figures 5.5 and 5.6 show isomach contours on a 129x33 grid for the follow- 
ing two-dimensional flow cases: inviscid subcritical flow over a 10% ick circu- 
lar arc blade at freestream Mach number 0.5, and inviscid shocked supercritical 
flow over the same blade with a freestream Mach number of 0.675. Figures 5.7 
and 5.8 show velocity profiles from a 65x33 grid for viscous separated laminar 
flow over a 5% thick circular arc blade and viscous separated turbulent flow 
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over the same blade. We note here that these converged steady-state flow 
results are identical (to the precision of the computer being used) for both unac- 
celerated MacCormack and MacCormack with multigrid solution procedures. 
Likewise, the use of vectorization and multitasking do not alter the flow results 
in any of these cases. 

Figures 5.9 and 5.10 depict corresponding inviscid results obtained with the 
two-dimensional embedded grid scheme. These cases are computed with two 
levels of grid refinement and with only the Euler equations solved in all regions. 
The global coarse grid is dimensioned 33x9; the refined region near the airfoil 
has spacing identical to a 129x33 grid. Comparison of the baseline results (Fig- 
ures 5.5 and 5.6) with embedded grid results shows that using simple grid 
embeddings with interpolations at the intergrid boundaries yields qualitatively 
acceptable flow solutions. The oscillations near the shock (due to lack of a good 
damping term for the very fine grid) are even duplicated. 

Three-dimensional results have been obtained for inviscid subcritical and 
supercritical and viscous laminar flows over swept and unswept blades. Figure 
5.11 shows surface Mach number distributions along selected planes for the 
65x17x17 subcritical and supercritical inviscid flow case with the blade at a 
sweep angle of 26 degrees. Figure 5.12 contains surface isomach plots compar- 
ing the inviscid supercritical flow over unswept and swept 10% thick blades cal- 
culated on a 65x17x17 mesh. Figure 5.13 presents velocity profiles at 50% span 
for flow at Reynolds number 3.4xl0 4 and sweep angle of 26 degrees. Again, 
results are identical for MacCormack and multigridded MacCormack. 
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5.3.2 Accuracy and Stability 

The MacCormack method yields a solution that is second order accurate in 
both space and time [62]. For very fine three-dimensional grids, artificial viscos- 
ity (in the form of a fourth-order term described in Chapter 3) is added to the 
scheme to maintain stability. The effect of this additional damping is to enable 
convergence while maintaining the capability for high resolution of the interest- 
ing features of the flow. 

Multigrid, when applied to the explicit predictor-corrector scheme, or to any 
other Lax-Wendroff scheme, for that matter, has been shown to yield the same 
solution to within the accuracy of the fine-grid scheme [76]. All results obtained 
here substantiate this assertion. 

The embedded grid scheme, however, can affect the flow results in certain 
cases if care is not taken. Proper placing of the grid refinements is crucial to 
obtaining a correct flow solution, especially when different equation levels are 
being modeled in adjacent regions. Treatment of the intergrid boundaries in a 
non-conservative manner also impacts accuracy. Obviously, coarser grids have 
larger truncation error and can only be used in regions of the flow where the 
physical quantities are well-behaved or smooth. 

The stability of the scheme is dependent on the treatment of the domain 
boundary conditions, the intergrid boundary conditions, and the amount of 
artificial viscosity introduced to stabilize the fine-grid solver. In all computa- 
tions the time-step method was implemented using local time stepping and was 
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run at 90% of the prescribed CFL limit. Tests in which the CFL limit was 
intentionally exceeded resulted in divergence of the solution. This implies that 
the observed region of stability of the scheme agrees with the theoretical predic- 
tion. 

S.3.3 Convergence Behavior 

Several components of the algorithm serve to accelerate the convergence to 
a steady-state solution. Multigrid uses a series of coarser grids underlying a glo- 
bal fine grid to enhance performance of a basic fine-grid scheme, while the 
embedded zonal grid scheme reduces computational cost by the elimination of 
unnecessary grid points and some of the viscous term calculations. 

Multigrid, applied to Lax-Wendroff-type methods for viscous and inviscid 
flow problems, has been tested extensively in two dimensions by Johnson 
[75,76,85]. In Table 5.1, a short summary of Johnson’s results for the baseline 
flow problems used in the present work is recorded. Speedup for these cases is 
measured as the work to obtain a converged solution for the fine-grid scheme 
compared to that for the multigridded scheme. In Ref. 75, the term "work 
reduction factor" is used analogously. A more complete presentation of the 
two-dimensional multigrid results may be found in any of the aforementioned 
references. 

The two-dimensional parallel coarse-grid algorithm maintains essentially the 
same convergence behavior as the sequential algorithm. Consequently, the 
multiple-grid speedups obtained with it are nearly identical to those of the 
sequential algorithm. 



88 


The multigrid method has been extended to three dimensions [87]. Used in 
conjunction with the explicit MacCormack method, speedup (as defined above) 
of nearly five times has been demonstrated for the subcritical test cases. The 
inviscid supercritical flow problem has a higher fine-grid convergence rate and 
therefore shows less improvement due to multigrid. These results are tabulated 
in Table 5.2. The basic fine grid for these test cases has 65x9x17 points, and 
three grid levels are used. The viscous test case yields a speedup factor of 4.4. 

Tables 5.3 and 5.4 contain the results of the embedded zonal multigrid algo- 
rithm performance studies. The solution procedure was first implemented in 
two dimensions [108]. Numerical experiments were performed to determine the 
optimal positioning of the grid refinement levels. The inviscid cases do not take 
advantage of the zonal-equation part of the solver, since the Euler equations are 
solved everywhere in the domain. They therefore represent a measure of the 
work reduction achieved be merely deleting grid points in areas of the flow field 
where gradients are expected to be small. Speedup for these cases is defined as 
the work to a converged solution on a globally fine grid (equivalent to the finest 
embedding level) compared to the work to convergence using the embedded 
multigrid scheme. Two levels of grid refinements were used (see Figure 3.5) and 
resulting speedups are presented in Table 5.3. For the viscous case the thin- 
layer Navier-Stokes equations were solved on the finest embedding and the 
Euler equations defined the flow elsewhere. The speedup of 30 is a measure of 
the combined effect of grid point elimination and simplification of the governing 
equations via the zonal solver. These very encouraging results prompted further 
investigation into the viability of the scheme for three-dimensional problems. 



The three-dimensional implementation of the scheme has demonstrated the 
performance characteristics contained in Table 5.4. The test grid has one 
refinement level corresponding to a global fine grid of 65x17x17 points (see Fig- 
ure 3.6). For these in viscid cases, although just the Euler equations are solved 
everywhere, the artificial viscosity that has been added to the three-dimensional 
code is only computed and added to the solution at the finest mesh level. Thus 
the 3-D inviscid cases no longer provide a measure of the improvement in per- 
formance due solely to grid refinement alone, since the damping computation 
has been eliminated from use on the coarser grids. In the viscous case, the full 
Navier-Stokes equations are solved in the refined mesh region and the inviscid 
equations are applied elsewhere. The performance improvements demonstrated 
by these simple test cases with only one embedding lead to the speculation that 
even better results are possible for more complex problems. 

Results have been obtained using the embedded multigrid for a case with 
the finest embedding equivalent to a global grid dimensioned 129x33x33, but, 
due to the difficulty of computing the corresponding baseline case (in terms of 
amount of CPU time required), no quantitative comparison can be made. How- 
ever, the ability of the embedded multigrid scheme to obtain such a solution in 
a more reasonable amount of time attests to its potential. 


5.4 Multitasking and Vectorixation Performance 

In addition to the investigation into various algorithmic components 
detailed above, a second, equally important study has been conducted to exam- 
ine and improve the performance of the scheme on advanced-architecture 
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supercomputers, in order to further minimize the 'time to solution" for large- 
scale flow problems. Both vectorization and multitasking on Cray and CDC 
supercomputers have been employed in this effort. 

5.4.1 Vectorization 

Vectorization of the two-dimensional parallel coarse-grid algorithm on a 
Cyber 205 yields speedups of 2.7 to 3.0 for the test cases considered. The vec- 
torization results are summarized in Table 5.5. The overall speedups for the 
explicitly vectorized parallel coarse-grid scheme are recorded and contrasted 
with both the scalar and explicitly vectorized versions of the sequential coarse- 
grid scheme. Although the explicitly vectorized sequential scheme is highly 
efficient, a noticeable performance improvement results from use of the parallel 
scheme. The table also compares the performance of the parallel coarse-grid 
code segment with the analogous segment in the sequential version. By this 
measure, the performance improvement over the explicitly vectorized sequential 
code is about 8 %. One should note that a performance gain of this size is 
significant since the potential for further vectorization of the code is quite low. 

The three-dimensional multigrid MacCormack code was optimized to take 
full advantage of the automatic vectorization performed by the Cyber 205 com- 
piler. Speedups of 4.2, 3.1, and 2.6 over the scalar code were obtained for 
selected inviscid subcritical, inviscid supercritical and turbulent viscous flows, 
respectively (see Table 5.6). 

Further vector speedup was obtained by implementing Cyber 205 explicit 
vector Fortran. The speedups produced by this hand-vectorized code are 
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contained in Table 5.6, and show an average improvement of 45% over the 
automatically-vectorized code. 

5.4.2 Multitasking 

Extensive investigation into parallel-processing and multitasking has been 
carried out. Various problem decomposition strategies and multitasking utili- 
ties have been implemented, and results of these investigations are presented 
here. 

Performance of parallel computers may be evaluated by comparing wall 
clock time for both uniprocessed and multiprocessed runs on a dedicated 
machine. This ratio is called the speedup. Dividing the speedup by the number 
of processors gives the efficiency of processor utilization, which provides a meas- 
ure of load balancing and overhead costs. 

A two-dimensional version of the code makes extensive use of parallelism 
inherent in the physical problem to obtain a good load balance when using the 
Cray-2 and X-MP macrotasking software. As the macrotasking approach 
requires the use of calls to a subroutine library, it has rather high overhead and 
thus yields best results for large-grained code segments. The problem has been 
decomposed both functionally and spatially. For the fine-grid scheme, the com- 
putational domain is partitioned into p strips (where p is the number of proces- 
sors available), in such a way as to preserve vectorizability of the innermost DO 
loops. The parallel coarse-grid implementation of the multigrid scheme yields 
large-grained tasks, consisting of entire grid levels, to facilitate multitasking. 
Table 5.7 shows the resulting speedups of the coarse grid computations alone, in 
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the range of 77% to 89% efficiency. Table 5.8 presents the fine-grid macrotask- 
ing results and contrasts them with the same decomposition strategy imple- 

V 

mented using microtasking. The microtasking results are only marginally 
better than the macrotasking ones because the algorithm has been restructured 
to maximize task granularity, with both getting good performance at over 90% 
efficiency. 

Table 5.9 presents the two-dimensional macrotasking results from both the 
Cray X-MP/48 and two Cray- 2s (with differing memory-access speeds) for the 
basic solver with multigrid. X-MP performance shows that the algorithm has 
been efficiently parallelized, while the poorer performance on the Cray-2s can be 
attributed to several factors. Macrotasking software is less mature on the 
Cray-2. By examining the difference in performance between the earlier Cray-2 
(with 120 nanosecond memory-access cycle time) and the newer model (with 80 
nanosecond access time), some of the decreased efficiency appears to be attribut- 
able to memory bank conflicts. The single path to memory may also impact 
multiprocessor efficiency. This is further discussed below. 

Three-dimensional microtasking results for a small-grained partitioning of 
the multigrid algorithm, due to Pryor [213], are shown in Table 5.10. 

The three-dimensional embedded zonal grid algorithm has been multitasked 
and tested on two Cray-2s with different memory-access speeds. Macrotasking 
has been implemented in the form of TSKSTARTs and TSKWAITs in the 
code, partitioned as described above. All performance data were obtained dur- 
ing dedicated time on both machines, and all runs were repeated at least three 
times to check the consistency of the timing data. To distinguish between the 
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two Cray-2 computers, they will be referred to here as Navier (serial number 
2013), with a faster (80 nanosecond) memory-access time, and Stokes (serial 
number 2002), with the slower (120 nanosecond) memory access. 

Table 5.11 shows the speedups and efficiency of processor utilization for 
two-, three-, and four-processor 100-time-cycle test cases of the three- 
dimensional code on both Navier and Stokes. A speedup of 2.67 on four proces- 
sors, corresponding to 67% efficiency, is attained on Navier. On two and three 
processors, efficiencies of 86% and 74% were achieved, respectively. Comparing 
these results with the results obtained on the slower-memory Stokes indicates 
that some of the multitasking inefficiencies may be due to bank conflicts, that 
is, contention between processors for access to the 256-Megaword shared 
memory. A 2.5% to 8% increase in efficiency is observed when using the faster 
memory Cray-2. However, these improvements do not reflect the 15-20% 
improvement expected from merely increasing the memory-access speed. This 
implies that the other factors, possibly including the single path to memory, 
also serve to deteriorate multitasking performance. 

Table 5.12 contains information about the best and worst performance of 
individual task groups, which were created by the algorithm partitioning. Each 
set of tasks may be classified (loosely) as one of the following: 1) loading one 
array into another (no operations); 2) simple computation (add and/or multi- 
ply) at all grid points; 3) many computations with unit stride at all points; 4) 
many computations with nonunit stride; and 5) heterogeneous tasks. Both the 
best- and worst-performing tasks in this study fall under category 4, indicating 
that task classification, at least by this scheme and for a complicated CFD 
problem, is not necessarily ? good predictor of multitasking performance. If the 
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tasks are ranked according to unitasked wall clock execution time (a measure of 
granularity, on a dedicated machine), the better speedups generally correlate 
with the longer task segments, an indicator that task overhead (or the ratio of 
overhead to granularity) is slowing down the computation. The tasks remain 
too complicated to sufficiently detect the presence and extent of bank conflicts, 
so that any conclusions about the effect of memory contention must be drawn 
from the above comparison of performance on the two Cray-2s. Additional 
insight might be gained from experiments consisting of identical tasks with 
different array sizes, which would vary the memory access patterns. 


6. Conclusions 


The objective of this research has been to develop a robust and efficient com- 
putational algorithm for the solution of two- and three-dimensional aerodynamic 
flows which are governed by the Reynolds-averaged Navier-Stokes equations. A 
major consideration in the development of this methodology has been to try to 
fully exploit the vector- and parallel-processing capabilities of current-generation 
supercomputers. 

This goal has been achieved with the following solution procedure. With an 
explicit flow solver as the basic ingredient, embedded refinements to the grid are 
introduced in regions of the flow field requiring high resolution (or, in other 
words, grid points are deleted in areas where high resolution is not needed). The 
basic scheme is then applied in a zonal implementation, whereby only the 
appropriate equations in the hierarchy from Euler to full Navier-Stokes are 
numerically solved depending on the physical characteristics of the flow, as well 
as the grid resolution, in various regions. Multigrid is applied to this embedded 
zonal method to accelerate convergence when steady-state solutions are desired. 
Vectorization and multitasking are used extensively to obtain the best perfor- 
mance on the most powerful computers available today. 

Each component has been tested in two and three dimensions, and then 
integrated into the full three-dimensional scheme. Performance statistics for the 
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various pieces of the algorithm have been gathered in an attempt to predict the 
capabilities of the complete scheme. 

A set of test cases representative of the internal flow through a tur- 
bomachinery blade row have been compiled. Results have been presented for a 
set of two- and three-dimensional model problems, both inviscid and viscous, 
subcritical and supercritical, and laminar and turbulent flows. 

In two dimensions, multigrid alone yields speedups ranging from 2.1 to 8.2 
over the fine-grid MacCormack scheme (as measured in work to convergence). 
For the three-dimensional problem, multigrid accelerates the computation by a 
factor of between 2.5 to 4.7. 

The embedded zonal multigrid solver in two dimensions shows improvement 
over the fine-grid MacCormack scheme by factors of 6.1 to 30.2 with two levels 
of grid embeddings. In three dimensions, with only one level of grid embedding, 
speedups as much as 16.0 have been obtained. 

Vectorization of the multigrid code in three dimensions on the Cyber 205 
resulted in performance increases by a factor of nearly six over the scalar code. 
Introduction of the parallel coarse grid scheme gained an additional 8% in com- 
putational speed. 

Much investigation has been carried out concerning the efficient use of paral- 
lel computers, and this topic has evolved into a focal point for the present 
research. Individual components and then the full scheme have been partitioned 
and multitasked. Using the Cray X-MP, two versions of multitasking were 
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implemented: macrotasking, a library-based utility which works best with large- 
grained tasks, and microtasking, a loop-level parallelization mechanism that can 
be efficiently used to multitask fine-grained code segments. Macrotasking was 
also employed on two Cray-2s having different memory access speeds. 

The efficiency of processor use through macrotasking the two-dimensional 
multigrid code ranged from 83% to 94% on four and two processors, respectively, 
on the X-MP/48, and 56% to 84% on the Cray-2 for the same cases. Microtask- 
ing on the X-MP improved four-processor performance to 95% efficiency on the 
basic MacCormack fine-grid scheme, and the faster-memory Cray-2 improved 
performance to 90% efficiency for the 2-D multigrid code. As another data 
point, the three-dimensional multigrid code was microtasked on the X-MP and 
yielded speedups of 1.96 and 3.55 on two and four processors, respectively. 

The fully integrated embedded zonal multigrid algorithm has been macro- 
tasked on the Cray-2. The best multitasking performance speedups attained 
(over the unitasked analog) are 1.7, 2.2, and 2.7 on two, three, and four proces- 
sors. A study of the performance of isolated components of the three- 
dimensional Navier-Stokes computation has also been carried out. 

The three-dimensional multitasked embedded zonal multigrid code produces 
flow solutions on parallel-processing supercomputers that, if computed with a 
unitasked MacCormack fine-grid flow solver, would take an unreasonable amount 
of CPU time. Hence, it presents a means of investigating flow fields which are 
currently intractable. The viability of the solution algorithm, detailed herein, 
has been demonstrated by flow results and performance measurements. 



7. Future Research 


The algorithm and parallel-processing research carried out in this work has 
resulted in almost as many questions being raised as answered. In this chapter, 
some of the outstanding issues are noted and categorized, and the direction of 
possible future work is discussed. 


7.0 Refinements 

The treatment of the intergrid boundary conditions in the embedded grid 
scheme by simple bilinear interpolation, while sufficient to demonstrate the 
algorithm’s viability, deserves further study. Various approaches applied by oth- 
ers in similar methods, including Usab [116], Rai [185], and Holst et al. [18], 
merit thorough examination in both two- and three-dimensional cases. As 
alluded to in Chapter 3, a conservative intergrid boundary scheme would 
presumably produce better results while increasing the computational work. 
Minimizing the cost of a conservative interface treatment would be a serious con- 
cern in such a research effort. 

7.1 Extensions 

A generalization of the grid embedding scheme to allow arbitrary numbers of 
grid levels and embedding regions should be implemented. The present 
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algorithm poses no barriers to such a generalization; an improved data manage- 
ment scheme (such as Usab’s pointer system) might also enhance efficiency, in 
addition to providing the obvious benefit of making the scheme more flexible. 

The generation of grid refinements could then be automated, as determined 
by the physics of the flow. The method of implementation would be based on 
previous work, such as Thompson and Ferziger [112], Berger [188], or Kallinderis 
and Baron [120,121], where either the magnitude of the pressure gradients or the 
solution residual is used to as a feature detection scheme to define refinement 
regions, which must be constrained to being regular. Berger’s work in this area 
is of particular interest because it also incorporates adaptive refinements, yet 
another possible enhancement to the present scheme. This would enable the grid 
refinements to adapt to the physically evolving flow structures and automatically 
redefine themselves wherever or whenever necessary. Such a capability is partic- 
ularly attractive for application to time-dependent flows. 

7.2 Algorithms 

As mentioned previously, all components of the solution procedure are expli- 
cit except for the surface pressure boundary condition, which consists of solving a 
normal momentum equation implicitly along x gridlines. The solution of this 
partial differential equation could also be computed with an explicit method, so 
as to remove the recursive computation so that the boundary condition would be 
vectorizable and parallelizable. Preliminary testing in two dimensions has shown 
that this yields results of the same order of accuracy as the implicit method and 
does, indeed, improve performance by a small percentage. Future work would 
include improving the two-dimensional implementation and then putting it into 



100 


the three-dimensional code. The advantages of using an explicit computation 
over an implicit one are improved code performance in both vector and parallel 
mode and reduction of the number of operations (and therefore the CPU time) 
required to compute the boundary conditions. 

The use of artificial viscosity to stabilise flow computations is another area 
needing further attention. Much research has been conducted by others into this 
topic, including Pulliam [214] and Jameson et al. [82,215]. The approach used 
here, i.e., computational experimentation to determine the damping coefficients, 
is not feasible for very large problems due to the high cost it would entail. Total 
variation diminishing (TVD) schemes, such as ENO [216,217], are used to prop- 
erly capture strong shocks and have been studied in detail by Yee and others 
[218-220]. 

The basic flow solver is not required to be the MacCormack method. If the 
explicitness of the solver is the most attractive feature, then the Runge-Kutta 
scheme of Jameson, Schmidt and Turkel [177] would be a viable alternative. If 
implicit schemes can be shown to have good performance on parallel computers, 
then one of the driving forces to maintain the explicitness of the basic flow solver 
is removed. Experimental work could be done with Beam- Warming [63] or 
another implicit method as the basic solver. 

A parallel coarse-grid acceleration scheme has been developed and tested in 
two dimensions with good results, and should be extended for the three- 
dimensional case. This would be a straightforward task. Another multigrid vari- 
ation, the fully parallel scheme, was attempted in two dimensions, but required 
underrelaxation and did not exhibit good convergence properties. It had the 
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advantage, however, of decoupling the fine and coarse grids from each other for 
each time step of the computation. Such asynchronous computation would be 
desirable for load balancing purposes. Further attention is perhaps merited. 

Residual averaging was also briefly studied during the course of this work. 
Both implicit and explicit schemes were implemented in two dimensions but not 
optimized and not included in the results here. Due to the good results obtained 
by others using such acceleration techniques, further investigation is warranted, 
followed by extension to three dimensions. 

7.3 Parallel-Processing 

A number of topics for research have arisen as a result of the multitasking 
investigation carried out here. Distributed-memory multiprocessors, parallel 
software tools, alternative partitioning strategies, and the development of new 
parallel algorithms are discussed in the following. 

The implementation of a CFD code on shared-memory parallel computers has 
been the focus of the current research. Some work is being performed to adapt 
such schemes to distributed-memory multiprocessors such as the Hypercube 
(MIMD) (by Bassett and Catherasoo [154], for example) and the Connection 
Machine (SIMD) (by Jespersen and Levit [163]). Early results show some promise 
for such architecture/algorithm combinations. An interesting task would be 
porting the three-dimensional embedded zonal multigrid scheme to such a 
machine. The spatial decomposition techniques used in the method are quite 
applicable to distributed architectures, and the explicit method allows an arbi- 
trary partitioning of the dcmain, enabling optimization of communications and 
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redistribution of the workload without algorithmic modifications. On systems 
with hypercube interconnection topologies, binary-reflected gray-code mapping 
has been demonstrated as a good means of determining the most efficient com- 
munications paths. 

The software tools provided for Fortran multitasking on the Cray-2 and X- 
MP are the macrotasking library subroutine calls and the microtasking prepro- 
cessor directives. Other extensions to Fortran for use on parallel machines have 
been developed, including the SCHEDULE package by Dongarra and Sorensen 
[168] and CoFortran by Weeks [167]. (See Appendix E for a brief description of 
the key routines of each.) 

The SCHEDULE package [168] is intended to be a short-term solution to the 
problem of the inavailability of portable multitasking software. Each parallel 
machine currently has its own language or language extensions, so that code 
developed for parallel execution on one computer must be modified to run on 
another. SCHEDULE is meant to be usable on any shared-memory multiprocess- 
ing system (provided that an interface has been written for that machine). The 
package has been implemented in the two-dimensional version of the parallel 
multigrid algorithm, which had already been macrotasked on the Cray (both the 
X-MP and Stokes). Rewriting the parallel-processing management was relatively 
straightforward except for one major drawback. SCHEDULE has no efficient 
means of repeating the same set of processes (or the same "task graph") over and 
over, as is done in an iterative method and in most CFD solvers, so that it is not 
yet sophisticated enough to be useful to the computational aerodynamicist. 
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CoFortran is another attempt at creating a portable parallel programming 
language [167]. It consists of a combination of Fortran and Fortran-like con- 
current language constructs. Like SCHEDULE, it is intended to be machine- 
independent, provided that an interface exists on the system to be used. It has 
been developed to provide the researcher with tools for the parallel solution of 
computational fluid dynamics problems, and is currently available on the X-MP 
and may soon be available on the Cray-2. It has not been tested with the algo- 
rithm of this work. 

Microtasking shows the most promise for improving the multitasking capabil- 
ities of the Cray-2, based on experience with it on the X-MP, and its incorpora- 
tion into the three-dimensional algorithm is called for. A code which has been 
macrotasked is easily converted to microtasking and should show improved per- 
formance, as has been demonstrated with the two-dimensional version of the 
multigrid algorithm on the Cray X-MP/48 [184]. 

As mentioned in the previous section, other schemes could be substituted for 
MacCormack to enhance the robustness and convergence properties of the solu- 
tion procedure. The parallel performance of these other schemes would also be 
investigated. Although explicit schemes are very straightforward to parallelize, 
they involve many iterations of small operations count. Implicit schemes, on the 
other hand, take fewer, high-operations-count iterations, and may hold greater 
opportunity, or at least another alternative, for creating large-grained tasks for 
parallelization. 

Another approach to developing parallel CFD codes, proposed by Shieh [221], 
is temporal decomposition. T t is described as follows. 
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Considering a 3D computation using an explicit or semi-implicit 
finite difference scheme, when a CPU is computing the jth grid plane 
for the nth time step, it is possible that one could proceed to compute 
the next time step, (n + l)th, for (j - 2)th grid plane with another 
CPU. Thus, for a model with m grid points in j direction excluding 
boundary points, one could utilize (m + l)/2 processors to computing 
in a pipe-line fashion... Each subtask processes the whole computa- 
tional domain. Hence, the overhead for establishing parallel tasks and 
data transfer from one task to another is minimal. 


7.4 Applications 

Another obvious direction for future research to take, since the method has 
been developed for complex flow problems, is to apply it to a realistic 
configuration. A new grid generator would be desirable and probably necessary 
for such an endeavor - perhaps Sorensen’s GRAPE code [222] or one of 
Thompson’s codes [136]. Possibilities for the application include a full 3D 
configuration, an 3D inlet or nozzle geometry, a shuttle configuration, or a 3D 
staggered blade row for a turbomachine. 

A different track to take would be to add species equations for reacting or 
dissociating gasses to the Navier-Stokes equations, resulting in a system of equa- 
tions with different behavior or stiffness characteristics, one for which the embed- 
ded multigrid scheme may be appropriate. 
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Tables 


Table J.l Characteristics of Japanese Supercomputers 


Vendor: 

Machine: 

(Year): 

Fujitsu 

VP200 

(1984) 

Hitachi 

S810/20 

(1983) 

NEC 

SX-2 

(1985) 

CPUs 

1 

1 

1 

Peak Performance 
Rate (MFLOPS) 

533 

630 

1300 

Clock period 
(nanoseconds) 

14/7 * 

14 

6 

Main memory 
(Megawords) 

8-32 

4-32 

16-32 

Secondary memory 
(Megawords) 

- 

32-128 

16-256 


* vector /scalar speed 
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Table 4-8 Characteristics of Supercomputers Used in Present Research 


Vendor: 

Machine: 

(Year): 

Control Data 
Cyber 205 
(1981) 

Cray Cray 

X-MP/1 or 2 X-MP/4 
(1982) (1984) 

Cray 

Cray-2 

(1985) 

CPUs 

1 

1 or 2 4 

4 


200/400 

233/466 932 

1952 

Clock period 
(nanoseconds) 

20.0 

9.5/8.5 

4.1 

Main memory 
(Megawords) 

4-16 

1-16 

256 

Secondary memory 
(Megawords) 

- 

32-512 

- 

Memory access 

80nsecs 

4cycles 

120/80nsecs 






Memory banks 


64 


128 




















123 






















124 


Table 5.1 Two-Dimensional Multigrid Speedups 


Test Case 

Speedup 

Inviscid Subcritical 

D 

Inviscid Supercritical 

2.1 

Laminar Viscous 

7.3 

Turbulent Viscous 

8.2 


Table 5.8 Three-Dimensional Multigrid Speedups 


Test Case 

Speedup 

Inviscid Subcritical 

4.7 

Inviscid Supercritical 

2.5 


I 


Turbulent Viscous 
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Table 5.S Two-Dimensional Embedding Speedups 
(129 x 33 finest grid, 2 embeddings) 


Test Case 

Speedup 

Inviscid Subcritical 

16.4 

Inviscid Supercritical 

n 

Turbulent Viscous 

30.2 


Table 5.4 Three-Dimensional Embedding Speedups 
(65 x 17 x 17 finest grid, 1 embedding) 


Test Case 

Speedup 

Inviscid Subcritical 

7.0 

Inviscid Supercritical 

4.6 

Turbulent Viscous 

16.0 



















126 


Table 5.5 Two-Dimensional Veetorization Speedups 


Test Case 

Complete Scheme 

Coarse-Gric 

Scheme 

Sequential 

Parallel 

Sequential 

Parallel 

Inviscid Subcritical 

■1 

3.18 

1.67 

1.80 

In viscid Supercritical 

2.89 

3.04 

1.73 

1.87 

Turbulent Viscous 

2.72 

1.86 

1.73 

1.86 
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Table 5.7 Two-Dimensional Maerotasked Parallel Coarse-Grid Scheme Performance 


Machine 

2 Processors 

4 Processors 

■1 


Speedup 


Cray X-MP 

1.78 

0.89 


0.77 


Table 5.8 Two-Dimensional Multitasked Basic Solver Performance 


Machine 

2 Processors 

4 Processors 

Speedup 


Speedup 


Cray X-MP 
with 

macrotasking 

1.91 

0.96 

3.58 

0.90 


1.93 

0.97 

3.78 

0.95 


Table 5.9 Two-Dimensional Maerotasked Multigriddcd Scheme Performance 



2 Processors 

4 Processors 

Machine 

Speedup 

mil 

Speedup 

imH 

Cray X-MP 

1.87 

0.94 

3.30 

0.83 

1 

ISSSSIflHH 

1.69 

0.84 

2.23 

0.56 

Cray 2 (AFWL) 
(2014) 

1.80 

m 

2.58 

0.65 
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Table 5.10 Three-Dimensional Mierotasked Multigridded Scheme Performance 


Machine 

2 Processors 

3 Processors 

4 Processors 

Speedup 


■1 




Cray X-MP 

1.96 

0.98 

2.83 

0.94 

3.55 

0.89 
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Table 5.12 Performance of Individual Task Groups, 
Three-Dimensional Macrotasked Multigridded Scheme 
(Sp = speedup, Ef = efficiency) 
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AERODYNAMICS 

COMPUTER SPEED AND MEMORY REQUIREMENTS 



SPEED, flops 


Figure 1.1. - Requirements for Computational Aerodynamics. 
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• - point used in predictor step 

□ - predicted point 

— - predictor step 



Figure 3.1 - MacCormack Finite-Difference Stencils 
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On«-Sttp Lax-VoBdroff Schema 
on FIbo Grid 


Coarse-Grid Sebm 


Figure 3.2 - Comparison of Fine- and Coarse-Grid Schemes 



R = Restriction from Fine to Coarse Grid 
P = Prolongation to Fine Grid 


Figure 3.3 - Sequential Multi grid 
Information Flow 


Figure 3.4 - Parallel Coarse-Grid Scheme 
Information Flow 



133 



Figure 3.5. - Example of Grid Embedding for a Two-Dimensional Problem 



Figure 3.6. - Grid Embedding for the Three-Dimensional Model Problem 
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Flow 

Section View Direction Perspective View 


i 


Figure 3.7 - Three-Dimensional Cascade 

i 



Figure 3.8 - Three-Dimensional Refinements: a. global coarse grid; 
b. first refinement; c. second refinement. 



Figure 3.9 - Three-Dimensional Grid Refinements 
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i 

b. MacCormack Time Step at Grid Points Exterior to Bold Line 
Multigrid Time Step at Points Interior to Bold Line 


Figure 3.10 - Implementation of Computational Scheme 
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c. MacCormack Time Step at Grid Points Exterior to Bold Line 
Multigrid Time Step at Points Interior to Bold Line 



d. Multigrid Time Step at All Points Shown 


Figure 3.10 - Implementation of Computational Scheme (cont’d) 
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Figure 4.1 - Memory Level Hierarchy 
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PROCESSORS 



MEMORY BANKS 


Full Crossbar Switch 



Bus 


Hypercube (4-dimensional) 


Figure 4.2 - Common Multiprocessor Interconnection Schemes 
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Figure 4.4 - Task Graph for Two-Dimensional Multigridded MacCormack Scheme 


I 


Figure 4.5 - Computational Domain with Planar Slices in the 
Lengthwise Direction 
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Figure 5.1. - Two-Dimensional Model Problems. 
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p 0 » T o* y/ u » w / u 
specified 


VISCOUS 



Tangency Condition 



Tangency No-Slip Condition 

Condition T specified 


Section View 



Po* T o» v/u » w/u 

specified 


Figure 5.2. - Three-Dimensional Model Problems. 
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49 
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Figure 5.3. - Typical Two-Dimensional Grids. 
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Figure 5.4. - Sample Three-Dimensional Grids for Inviscid and Viscous Cases. 
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Figure 5.5. - Two-Dimensional Inviscid Subcritical Flow. 
Isomach Contours. 



Figure 5.6. - Two-Dimensional Inviscid Supercritical Flow. 
Isomach Contours. 
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0 1 

Normalized u-velocity 


5.7. - Two-Dimensional Viscous Laminar Flow. 
Velocity Profiles, Re = 3.4 x 10*. 



0 1 

Normalized u-velocity 


Figure 5.8. - Two-Dimensional Viscous Turbulent Flow. 
Velocity Profiles, Re = 3.4 x 10 4 . 


Figure 5.9. - Two-Dimensional Inviscid Subcritical Row. 
Isomach Contours. (Using Embedded Grids) 



Figure 5.10. - Two-Dimensional Inviscid Supercritical Row. 
Isomach Contours. (Using Embedded Grids) 
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AXIAL COORDINATE 


Figure 5. 1 1. - Inviscid Three-Dimensional Flow. Sweep Angle = 26 Degrees. 




Figure 5.12. - Inviscid Supercritical Three-Dimensional Flow. 
Surface Isomach Contours, Unswept and Swept Blade. 
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NORMALIZED U'VELOCITt 

Laminar Flow. 


i 



0.0 10.0 


NORMAL I ZEO U-VELOCtTY 


Turbulent Flow. 

Figure 5.13. - Viscous Three-Dimensional Flow. Velocity Profiles at 50% Span, 
Re = 3.4 x 10*, Sweep Angle = 26 Degrees. 
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APPENDIX A. Generalised Coordinate Form, Equations of Motion 


The equations of motion may be written in generalized coordinate form as 
q\ + + G\ + H\ = 0 

where, for the Euler equations, 

F' = /', G' = g', W = h’ 

for the thin-layer equations, 

F' = /', G' = g\ H' = h' - Re~ l S', 

and for the full Reynolds-averaged Navier-Stokes equations, 

F' - f' - Rt x d' , G' = g' - Re~ l r', H' = h' - Re~ l s', 

The transformed variables (denoted by primes) represent the following vectors 


P 

pu 

pv 

pto 

E 


' - J 


9 = 


-i 



pU 


puU+l- t p 

r = r l 

P vU+t y p 


P vU+i t p 


(E+p)U-i t p 


• - J 


9 = 


-l 


pv “j 

puK+ti f p | 
P vV+t) y P j 

j pwV+i\ a p 
l ^(E + p)V-j) t p 


h' = J 


-l 


pW 

puW-n s p 

pvW+i y p 

pw\v+i t p 

(E+p)W-l t p 
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where the contravariant velocities are defined as 
u = + i 9 v + | f u> 


V = *1, + tl f U + + T) t 


w 


and the viscous terms are 


d' = J 


-l 


0 


0 

£* t ** ^y T *y ^f T *« 

r' = J 1 

'Ha 1 ’** ^ly^y 

£* T y* + ^>y T yy ^* T y* 

T l* T yi ^y T yy ^i T y* 

^* T « ^y T iy %x r u 


T l a T jat "Hy^jy T l* T *r 

«.P. + + «.P. _ 


+ *lA 


s' = J 


-l 


XX 


0 

{.*» + t,T„ + (,T 

£t + £ t + £ t 
** y* b y yy b * y* 

is* + 

t,P, + C,P, + «.P, 

Viscous terms for the thin-layer form of the equations, contained in S', are 


0 

|£({,’ + i’ + {,’)“{ + (l^3)({,u { + i f v ( + l.U’jK. 

M-({* + lj + l>, + (^3)(«,“ t + {,-( + {,«{){, 

(£(C I + il + t»V { + (|i/3)(C,u t + {,» { + {,>»;)«, 

[ (d + {J + {,*) [o.5|l(u J + v 2 + w\ + *Pr l (y - l)'V) t ] 

+ (^)({,“ + {,» + C, w )(i, u l + {,”{ + l, w {) ] 


S' = J 


-1 
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The terms t and (5 in the viscous flux vectors are defined as in Chapter 2, with 
any derivatives previously written in Cartesian space now expanded in 
transformed space by the chain rule. For example, u g is expanded as 


«. = 

The Jacobian of the transformation may be written as 

J 1 = + x l y k z i\ + x i\ y l z i ~ x &i z i\ ~ x i\ y i Z i ~ x l y i\ z l 

and the metric terms are 

~ VfJ *1, = •'(v'c " y t z J 

r\ y = J{x - z { s 5 ) 

= ~ Vj) *»* = J ( y i x i ~ x i y i) 


t s - AVfv z ^Vy) Mi My Mi 

= V« “ X i\) ■n* = - VU ~ - VI, 


t, = - »«*„) It = “ Ms ~ “ Ms 

For a temporally invariant grid (like the one used in the present algorithm), 
t = t , and the time-dependent derivatives in the metric terms vanish. 
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APPENDIX B. Derivation of the MacCormack Scheme from 
Taylor Series Expansion (in 1-D) 


The MacCormack scheme can be derived from the Taylor series expansion 

At 2 

q(t + A t) = q(t) + A tq t + q tt + • • ■ 

2 


{BA) 


Recalling that the equations of motion in one dimension may be represented by 

8/ 

q = - / , defining A = — , and applying Euler’s theorem on homogeneous 

8 q 

8 / 

functions [223] (which gives / = — q = Aq ), it can then be shown that 

hq 

Qtt = 

Substituting the equations for q { and q tt into (B.l) and truncating the expansion 
after the second order term yields the following: 


At 2 


q? = + —a/. 

m 


(B. 2 ) 


Discretizing the spatial derivatives in (B.2), 


n+1 n A , 
q. = q. - At 


If* - f* 1 
'•+1 *1—1 

At 2 

U (/,* +1 -2 /’+/”.,) 

2Az 

+ 

2 

Ax 2 


Since /" = Aq* and f*_ l = Aq*_ x , we can write 
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» + l A 

9, = 9, 


At 
2Ax L 


-/D + 


a (?,” - t 1 (C, -/,")) 

Ax 


(fl.3) 


At 


- a (Ct - — (/; - /; 

Ax 


-•>]] 


Defining 


9, = 9,” ~ “ (Ci “ C> A = /(*<)» and /,_ 1 = (B-4) 

Ax 


then from (B.4) and Euler’s theorem 


f, = A 9, = A 


«T - rfo. - /.") 

Ax 


(B.5) 


A-i = A 


At 

9,-i - " (/, - /,- 1) 

Ax 


(««) 


Substituting (B.5) and (B.6) into (B.3) yields 


n + 1 n 

9, = 9, 


At 

2Ax 


[(/,>■ - /,") + (7, - /'-,)] 


Thus, the two-step MacCormack scheme, derived from a Taylor series expansion, 
can be written in forward predictor - backward corrector form as 


9, = 9," ~ • (Cl - C 

Ax 


n + l n 

9, = 9, 


At 

2Az 


[(/," +1 - /,”) + (7, - /'-,)] 
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APPENDIX C. CFL Stability Limit Analysis (in 2-D) 


The inviscid stability limit for the equations of motion may be derived from 
nonconservative form of the Euler equations, 


q t + Aq t + Bq y = 0 , 
where 

« p 0 

0, u 0 

0 0a 
0 pc* 0 

If we define C and 6 such that 


0 


t; 

0 

P 

0 " 

1/p 

B = 

0 

V 

0 

0 

0 

0 

0 

V 

1/p 

u 


0 

0 

2 

pc 

t; 



C = Asina + Bsin(5 , 


sina 

cos0 = — 

V * ^ I • 2 n 
sin a + sm 3 


cosa 

and sin6 = — 

V * 2 | • 2 a 

sin a + sm (3 


then C can be rewritten as 


C — ^sin*a + sin*p 


a 

0 

0 

0 


pcosO 

psinO 

0 

a' 

0 

(1/p) COS0 

0 

a' 

(1/p) sin0 

pc*cos0 

pc*sin0 

a' 


where u' = ucosO + acos0 and c is the local speed of sound. Transforming 
the Euler equations to generalized coordinate form, 


q\ + Aq' h + Bq ^ = 0 , 
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the transformed matrix, C , is then defined as 


1 

u' 

P*! 

P* 2 

0 

C — At^ (sina/A^) 2 + (sinp/Ati) 2 

0 

u' 

0 

(1/p) k 




(1/p) * 

0 

0 

u r 


_0 

2, 

pc 

pe 2 k 2 

u' 


where u' is a function of the contravariant velocities (which are defined in 
Appendix B) 

u' = UcosQ + VsinG 

and 


fcj = (jjCosG + T) x sin0 
k 2 ~ £ y cos0 + ^ ly 8 ”* 0 

The eigenvalues of the matrix contained in the definition of C are 

= u', fju 2 = p. 3 = u' + c\/ k\ + k* , p. 4 = u' - eV* * + 

so that, for A£ = Ati = 1, the four eigenvalues of C can be written as 

r - 

u' 

u 9 

\ = AtV g i n 2 a + sin 2 p / _ 

] U' + cV^ + *2 


u' — e\/fc 2 + fcj 
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The stability condition requires that the spectral radius X be less than 1, or that 
A(=s [ I CM + I VI + + $ + 2 (5,-n, + i^j + (n’ + r,J)] 


The three-dimensional generalized-coordinate form of the stability limit, by a 
similar derivation, may be written 

A<< [ I 17 1 + I V\ + I W\ 

+ « [al + + ft + (% + \ + x) + (d + c, + ft 

+ 2 (C,(t1 x + i x ) + + iy) + T\yly + € g (*l f C # ) + *1,£, )]T 
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APPENDIX D. Generalised Coordinate Form, Artificial Dissipation 

The artificial damping used for supercritical flows may be expressed in gen- 
eralized coordinates as 

^9 v = — [( I I + ( I I ?„)„ + ( 1 C f p { I ?{){] 

«/ 

The artificial dissipation term for very fine grids in three-dimensional compu- 
tations may be expressed in generalized coordinates as 

We note that the terms C^, Cy and in (D.l) may be interpreted as 
corresponding to arclengths along which the artificial dissipation acts. These 
lengths are defined to be 
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APPEND EX E. SCHEDULE, CoFortr&n, Microtasking, and 

Macrotasking 

SCHEDULE: Fortran extensions [171] 

Static Dependency Graph 

CALL DEP - assigns data dependencies for each task 
CALL PUTQ - puts each task in the queue for execution 

Dynamic Spawning 

CALL NXTAG - automatically assigns data dependencies and forces 

parent process to wait for task completion 
CALL SPAWN - puts the process in the execution queue 

Low-Level Synchronization 

CALL LOCKON - protects code section from concurrent execution 
CALL LOCKOFF - marks the end of protected code 

CoFortran: Fortran extensions [170] 

Concurrency Statements 

CoProcess - first statement of concurrent subroutine 
CoBegin - begin parallel block of code 
CoEnd - end parallel block of code 

Synchronization Statements 

Colnit - creates CoProcess from the main program 

CoStop - signals end of parallel code management in main program 

CoStart - starts parallel execution of CoProcess 

Co Wait - synchronization point in parallel execution 

Data Passing 

SHARE - share a common block among processes 
RELEASE - unshare the common block 
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Microtasking: Preprocessor directives [152] 


CMIC$ MICRO 
CMIC$ CONTINUE 

CMIC$ DO GLOBAL 

CMIC$ PROCESS 

CMIC$ ALSO PROCESS 
CMIC$ END PROCESS 
CMIC$ STOP ALL PROCESS 


- denotes subroutine to be microtasked 

- continue parallel execution at next 

subroutine level 

• starts a parallel task for each pass through 
the following DO-loop 

- begins control structure (with ALSO and 
END PROCESS), starting a parallel task 

- starts next parallel task (ends previous task) 

- end of parallel task control structure 

- exit PROCESS/DO GLOBAL 


Macrotaaking: Fortran extensions [195] 


CALL TSKTUNE - enables tuning of processor handling 
Task Creation 

CALL TSKSTART - starts parallel tasks 
CALL TSKWAIT - synchronizes parallel tasks 


Message Passing 
CALL EYPOST 
CALL EVWAIT 
CALL EYCLEAR 


- posts an event 

- waits for an event to be posted 

- clears (or unposts) an event 


Barrier Synehronizaton 

CALL BARAS GN - creates a barrier for tasks 
CALL BARSYNC - registers arrival of task at barrier 

Low-Level Synchronization 
CALL LOCKON - protects critical region 
CALL LOCKOFF - end protection of critical region 
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